Reranking

Reranking is a second retrieval stage that reorders a list of fetched candidates by their estimated relevance to a query. The first pass — usually a search over a vector database or keywords — quickly returns dozens to hundreds of possible documents, but ranked only coarsely; reranking reorders them precisely and keeps just the few best.

The key difference lies in the model. Initial retrieval uses a bi-encoder that embeds the query and each document separately, making it fast but less accurate. The reranker is typically a cross-encoder: the query and document pass through the model together, enabling direct attention between their tokens and producing a far better-calibrated relevance score. The cost is heavier inference, so only a narrow shortlist of candidates is reranked.

Through 2025–2026, reranking is a standard component of production RAG pipelines because it measurably lifts retrieval accuracy for little added latency. Commercial models (Cohere Rerank, Jina, Voyage) and open-weight rerankers (BGE, FlashRank) make it readily available.

Sources

See also