The Problem RAG Solves
Large language models are trained on a fixed dataset with a knowledge cutoff date. They can't access your company's internal docs, recent data, or domain-specific information. When asked about things they don't know, they either refuse or hallucinate.
Retrieval-Augmented Generation (RAG) fixes this by giving the model access to external knowledge at query time. Instead of relying solely on what it memorized during training, the model retrieves relevant documents and uses them as context to generate answers.
How RAG Works
A RAG pipeline has three stages:
- Indexing — Your documents are split into chunks, converted to vector embeddings, and stored in a vector database.
- Retrieval — When a user asks a question, the question is embedded and the most similar document chunks are retrieved.
- Generation — The retrieved chunks are inserted into the prompt as context, and the LLM generates an answer grounded in that context.
Step 1: Document Chunking
Raw documents are too long to fit in an LLM's context window, so they're split into overlapping chunks — typically 200-500 tokens each.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)
print(f"Created {len(chunks)} chunks")
Chunks too small lose context. Chunks too large dilute relevance. Overlap ensures ideas that span chunk boundaries aren't lost. Experiment with different sizes for your data.
Step 2: Embedding and Indexing
Each chunk is converted into a dense vector (embedding) that captures its semantic meaning. Similar chunks will have similar vectors.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(texts):
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [r.embedding for r in response.data]
# Embed all chunks
vectors = embed(chunks)
These vectors are stored in a vector database like Chroma, Pinecone, Weaviate, or pgvector. The database supports fast nearest-neighbor search.
Step 3: Retrieval
When a user asks a question, embed the question using the same model and find the top-k most similar chunks:
import chromadb
collection = chroma_client.get_collection("docs")
results = collection.query(
query_texts=["How does authentication work?"],
n_results=5
)
relevant_chunks = results["documents"][0]
Cosine similarity is the standard metric — the higher the score, the more semantically similar the chunk is to the query.
Step 4: Generation with Context
The retrieved chunks are injected into the prompt, and the LLM generates an answer grounded in the retrieved information:
context = "\n\n".join(relevant_chunks)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer questions using only the provided context. If the context doesn't contain the answer, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: How does authentication work?"}
]
)
When to Use RAG
- Internal knowledge bases: Company wikis, documentation, policies.
- Up-to-date information: News, product catalogs, research papers.
- Domain-specific Q&A: Legal, medical, financial documents.
- Reducing hallucination: Grounding answers in source documents.
RAG vs Fine-Tuning
These are complementary, not competing approaches:
- RAG is best when the knowledge changes frequently, you need source attribution, or you have a large corpus.
- Fine-tuning is best when you need to change the model's behavior, tone, or format — not just its knowledge.
- Many production systems use both: a fine-tuned model with RAG for dynamic knowledge.
Common Pitfalls
- Bad chunking: Splitting mid-sentence or mid-paragraph destroys context.
- Wrong embedding model: Use the same model for indexing and querying.
- Too few results: Retrieving only 1-2 chunks may miss relevant info.
- No reranking: The top results from vector search aren't always the most useful — adding a reranker (like Cohere Rerank) helps.
- Ignoring evaluation: Measure retrieval quality (precision@k, recall) and generation quality (faithfulness, relevance) separately.
RAG is the most practical way to make LLMs useful for real-world applications. It's not glamorous, but it works — and that's what matters in production.