Berlin Buzzwords 2025

Exploring reranking depth in modern search pipelines
2025-06-16 , Frannz Salon

The use of semantic reranking on top of a ‘cheaper’ retrieval step is common in modern search applications. The reranking depth represents the number of documents that we select to retrieve and feed into the reranking model. We experiment with different models and datasets and we present our findings including some counterintuitive ones.


The use of semantic reranking on top of a ‘cheaper’ retrieval step becomes more and more common in modern search applications. It offers a different cost quality profile to semantic retrieval trading indexing time compute for retrieval time compute. The depth represents the number of documents that we select to retrieve and feed into the reranking model in order to optimise their ordering. Intuitively, there is a “natural” trade-off between the uplift we can achieve by operating on an increased pool of candidates and the associated cost of running “expensive” semantic rerankers for longer. In this presentation we start by investigating the behaviour of different models across different scenarios and we present our observations including some counterintuitive ones. Then, we attempt to explain the emergence of certain patterns and finally we revisit the “efficiency vs effectiveness” trade-off from two different perspectives.

Here, is an outline of the talk:

First, we analyse the retrieval performance as a function of the reranking depth and we identify three main patterns:
- Fast increase followed by saturation: this is the most common scenario where larger reranking depth leads to increased performance
- Fast increase to a maximum then decay: this is the first “counter-intuitive” result where reranking is beneficial until a certain depth but then performance degrades
- Steady decay: this is the case where the reranker actually worsens the ordering of the results provided by the retriever - it’s the least common scenario but still a counter-intuitive result

Second, we dive into these three classes and we attempt to explain the observed behaviour. For the first pattern we design a curve fitting procedure which provides a surprisingly good fit. For the other two cases we discuss some potential underlying causes for the performance decline.

Third, we connect our findings to existing works in the industry or academia and we highlight some of the dataset characteristics that seem relevant to the observed results

Fourth, we show how the interplay of the scores between positive(or relevant) and negative(or irrelevant) documents can explain the emergence of the patterns

Finally, we revisit the “efficiency vs effectiveness” tradeoff . We start with a “latency-free” analysis where we focus only on the evolution of our performance metric and examine the possibility of using a smaller reranking depth without losing much of the gains. We also show how this correlates with the recall performance of the first-stage retriever. Then, we incorporate the latency cost in order to present a more realistic scenario and explain the trade-offs under different budget constraints.

This talk is relevant to the audience because:
- Retrieval performance remains critical in modern applications such as RAG.
- Highlights the importance of domain-specific evaluation


Tags:

Search, Data Science

Level:

Intermediate

Senior ML Engineer / NLP at Elastic