RAG Fundamentals
When you call index.query(vector=embed(question), top_k=5) on a vector database and back come five chunks of text, you’ve kicked off the most-deployed AI-engineering pattern in 2026. — retrieval-augmented generation — is what lets a model with a knowledge cutoff six months ago answer questions about your private docs from this morning. The mechanic looks like one library call, but five separate pieces of machinery line up underneath: a decided how to split the source documents, an encoder embedded those chunks, a sorted them by cosine similarity to your query, and what it didn’t return — the more relevant chunk at rank 12 — is the failure mode you’ll spend the next month chasing.
This lesson builds the full pipeline — chunk, embed, retrieve, rerank, generate — with measurements at every stage. Done well, RAG turns a generic LLM into a domain expert overnight; done poorly it hallucinates with footnotes. The single highest-leverage stage is the cross-encoder reranker — adds ~100 ms of latency and lifts Recall@5 by 15 points on standard benchmarks. Get that one right and most “our RAG is bad” complaints disappear.
TL;DR
- RAG = retrieval + generation. Pull the most relevant chunks for a question, stuff them in the prompt, let the model answer.
- The pipeline: chunk documents → embed → store → retrieve top-k → re-rank → stuff into prompt → generate.
- Hybrid retrieval (dense + BM25 fused via RRF) beats either alone. Dense gets meaning; BM25 gets exact keywords.
- A cross-encoder re-ranker is the highest-leverage stage. It sees the query and chunks together and re-scores. Adds 100 ms, can lift Recall@5 by 15+ points.
- Chunking is underrated. Recursive character splitting at ~400 tokens with 50-token overlap is the sane default. Semantic chunking is sometimes worth it; rarely worth it from day one.
Mental model
Five stages, each with a knob. Most teams build the diagram once, then tune the knobs forever.
The five stages, in code
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
from rank_bm25 import BM25Okapi
import numpy as np
# ---------- 1. Chunk ----------
def chunk_recursive(text, size=400, overlap=50):
words = text.split()
out = []
i = 0
while i < len(words):
out.append(" ".join(words[i:i + size]))
i += size - overlap
return out
corpus = [chunk for doc in documents for chunk in chunk_recursive(doc)]
# ---------- 2. Embed ----------
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = embedder.encode(corpus, normalize_embeddings=True)
# ---------- 3. Store ----------
index = faiss.IndexFlatIP(embeddings.shape[1]) # inner product = cosine since normalized
index.add(embeddings)
# Also build a BM25 index for hybrid search.
bm25 = BM25Okapi([c.split() for c in corpus])
# ---------- 4. Retrieve (hybrid) ----------
def retrieve(query, k=20):
qv = embedder.encode([query], normalize_embeddings=True)
_, dense_ids = index.search(qv, k)
dense_ids = dense_ids[0].tolist()
bm25_scores = bm25.get_scores(query.split())
bm25_ids = np.argsort(bm25_scores)[-k:][::-1].tolist()
# Reciprocal Rank Fusion: 1 / (60 + rank)
rrf = {}
for rank, idx in enumerate(dense_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank)
for rank, idx in enumerate(bm25_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank)
fused = sorted(rrf.items(), key=lambda kv: -kv[1])[:k]
return [corpus[i] for i, _ in fused]
# ---------- 5. Rerank ----------
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def rerank(query, candidates, top_n=5):
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c for c, _ in ranked[:top_n]]
# ---------- 6. Generate ----------
def answer(query):
candidates = retrieve(query, k=20)
top = rerank(query, candidates, top_n=5)
context = "\\n\\n".join(f"[{i+1}] {c}" for i, c in enumerate(top))
prompt = f"Use only the context to answer. Cite [n] for each claim.\\n\\nContext:\\n{context}\\n\\nQuestion: {query}\\n\\nAnswer:"
return call_llm(prompt)Each stage has knobs: chunk size, chunk overlap, embedding model, vector index type, top-k, RRF constant k=60, reranker top-n, prompt template. The whole pipeline is ~80 lines and zero LangChain.
Real numbers
On the SciFact benchmark (claim verification, smaller is harder):
| Stage | Recall@5 | p50 latency |
|---|---|---|
| Dense only | 64% | 20 ms |
| BM25 only | 58% | 5 ms |
| Hybrid + | 71% | 25 ms |
| Hybrid + | 86% | 130 ms |
The reranker is doing 15 percentage points of work for ~100 ms of latency. It’s the single highest-leverage component in the stack.
Run it in your browser
A pure-Python RAG demo using cosine similarity over hand-crafted embeddings. No external APIs.
Quick check
Key takeaways
- Build the simplest RAG first (recursive chunking + dense retrieval + top-3) before reaching for advanced techniques.
- Hybrid + rerank is the real default for production. Plain dense retrieval has well-known failure modes on numbers, code, and exact identifiers.
- Cite your context. Always instruct the model to attribute claims to context indices. It dramatically reduces hallucinations and gives users something to verify.
- Eval honestly. Use a real eval set (50–200 questions with reference answers) and metrics like faithfulness + answer-relevance. Vibes are not enough.
- Chunking changes everything. When RAG is bad, chunking is the most likely cause; only after fixing chunking should you tune the retriever.
Go deeper
- PaperRetrieval-Augmented Generation for Knowledge-Intensive NLPThe original RAG paper. Foundational.
- PaperRetrieval-Augmented Generation for LLMs: A SurveyThe most current literature review. Use as a reference.
- BlogPinecone RAG learn seriesStrong intro material; ignore the implicit Pinecone bias.
- DocsLlamaIndex documentationThe framework that wraps the most RAG idioms; even if you write from scratch, the docs are a useful design reference.
- VideoKarpathy — Let's build GPTNot RAG-specific, but the embedding section is unmatched.
- Repomicrosoft/graphragWhen you outgrow chunk-and-embed, the path forward.
TL;DR
- RAG = retrieval + generation. Pull the most relevant chunks for a question, stuff them in the prompt, let the model answer.
- The pipeline: chunk documents → embed → store → retrieve top-k → re-rank → stuff into prompt → generate.
- Hybrid retrieval (dense + BM25 fused via RRF) beats either alone. Dense gets meaning; BM25 gets exact keywords.
- A cross-encoder re-ranker is the highest-leverage stage. It sees the query and chunks together and re-scores. Adds 100 ms, can lift Recall@5 by 15+ points.
- Chunking is underrated. Recursive character splitting at ~400 tokens with 50-token overlap is the sane default. Semantic chunking is sometimes worth it; rarely worth it from day one.
Why this matters
A frontier model’s knowledge cutoff is many months ago. Your private documents were never in training. Without retrieval, the model can only fabricate answers about both. RAG is the most-deployed pattern in AI engineering right now because it solves a real problem with off-the-shelf parts: give the model relevant information at inference time.
Done well, RAG turns an LLM into a domain expert overnight. Done poorly, it hallucinates with footnotes.
Mental model
Five stages, each with a knob. Most teams build the diagram once, then tune the knobs forever.
Concrete walkthrough — RAG over arXiv abstracts
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss
from rank_bm25 import BM25Okapi
import numpy as np
# ---------- 1. Chunk ----------
def chunk_recursive(text, size=400, overlap=50):
words = text.split()
out = []
i = 0
while i < len(words):
out.append(" ".join(words[i:i + size]))
i += size - overlap
return out
corpus = [chunk for doc in documents for chunk in chunk_recursive(doc)]
# ---------- 2. Embed ----------
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = embedder.encode(corpus, normalize_embeddings=True)
# ---------- 3. Store ----------
index = faiss.IndexFlatIP(embeddings.shape[1]) # inner product = cosine since normalized
index.add(embeddings)
# Also build a BM25 index for hybrid search.
bm25 = BM25Okapi([c.split() for c in corpus])
# ---------- 4. Retrieve (hybrid) ----------
def retrieve(query, k=20):
qv = embedder.encode([query], normalize_embeddings=True)
_, dense_ids = index.search(qv, k)
dense_ids = dense_ids[0].tolist()
bm25_scores = bm25.get_scores(query.split())
bm25_ids = np.argsort(bm25_scores)[-k:][::-1].tolist()
# Reciprocal Rank Fusion: 1 / (60 + rank)
rrf = {}
for rank, idx in enumerate(dense_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank)
for rank, idx in enumerate(bm25_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank)
fused = sorted(rrf.items(), key=lambda kv: -kv[1])[:k]
return [corpus[i] for i, _ in fused]
# ---------- 5. Rerank ----------
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def rerank(query, candidates, top_n=5):
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c for c, _ in ranked[:top_n]]
# ---------- 6. Generate ----------
def answer(query):
candidates = retrieve(query, k=20)
top = rerank(query, candidates, top_n=5)
context = "\\n\\n".join(f"[{i+1}] {c}" for i, c in enumerate(top))
prompt = f"Use only the context to answer. Cite [n] for each claim.\\n\\nContext:\\n{context}\\n\\nQuestion: {query}\\n\\nAnswer:"
return call_llm(prompt)Real numbers on the SciFact benchmark (claim verification, smaller is harder):
| Stage | Recall@5 | p50 latency |
|---|---|---|
| Dense only | 64% | 20 ms |
| BM25 only | 58% | 5 ms |
| Hybrid + RRF | 71% | 25 ms |
| Hybrid + Rerank | 86% | 130 ms |
The reranker is doing 15 percentage points of work. It’s the single highest-leverage component in the stack.
Run it in your browser
A pure-Python RAG demo using cosine similarity over hand-crafted embeddings. No external APIs.
Quick check
Key takeaways
- Build the simplest RAG first (recursive chunking + dense retrieval + top-3) before reaching for advanced techniques.
- Hybrid + rerank is the real default for production. Plain dense retrieval has well-known failure modes on numbers, code, and exact identifiers.
- Cite your context. Always instruct the model to attribute claims to context indices. It dramatically reduces hallucinations and gives users something to verify.
- Eval honestly. Use a real eval set (50–200 questions with reference answers) and metrics like faithfulness + answer-relevance. Vibes are not enough.
- Chunking changes everything. When RAG is bad, chunking is the most likely cause; only after fixing chunking should you tune the retriever.
Go deeper
- PaperRetrieval-Augmented Generation for Knowledge-Intensive NLPThe original RAG paper. Foundational.
- PaperRetrieval-Augmented Generation for LLMs: A SurveyThe most current literature review. Use as a reference.
- BlogPinecone RAG learn seriesStrong intro material; ignore the implicit Pinecone bias.
- DocsLlamaIndex documentationThe framework that wraps the most RAG idioms; even if you write from scratch, the docs are a useful design reference.
- VideoKarpathy — Let's build GPTNot RAG-specific, but the embedding section is unmatched.
- Repomicrosoft/graphragWhen you outgrow chunk-and-embed, the path forward.