Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is the dominant pattern for getting accurate, up-to-date answers from a large language model. Rather than relying solely on the LLM’s frozen training data, a RAG system first retrieves relevant chunks from an external knowledge base — internal docs, web pages, customer records — and includes them in the prompt as context for the model to ground its answer.

The classic pipeline:

Index time: documents are split into chunks, converted to embeddings, and stored in a vector database.
Query time: the user’s question is embedded; the database returns the top-k semantically similar chunks.
Generation: retrieved chunks are inserted into the LLM prompt with instructions like “Answer using only the context below.”

RAG mitigates hallucination, keeps content current without retraining, and provides citations the user can verify. It is the engine behind Perplexity, ChatGPT browsing mode, enterprise knowledge assistants, and most production LLM applications.

2025 brought variations: hybrid search (vector + keyword), rerankers (Cohere, Jina), GraphRAG (Microsoft) for complex relations, and Agentic RAG where an agent plans multi-step retrieval. The fundamentals remain: get the right context to the model, and grounded answers follow.

Sources

See also