Skip to content

RAG Fundamentals

When you call index.query(vector=embed(question), top_k=5) on a vector database and back come five chunks of text, you’ve kicked off the most-deployed AI-engineering pattern in 2026. — retrieval-augmented generation — is what lets a model with a knowledge cutoff six months ago answer questions about your private docs from this morning. The mechanic looks like one library call, but five separate pieces of machinery line up underneath: a decided how to split the source documents, an encoder embedded those chunks, a sorted them by cosine similarity to your query, and what it didn’t return — the more relevant chunk at rank 12 — is the failure mode you’ll spend the next month chasing.

This lesson builds the full pipeline — chunk, embed, retrieve, rerank, generate — with measurements at every stage. Done well, RAG turns a generic LLM into a domain expert overnight; done poorly it hallucinates with footnotes. The single highest-leverage stage is the cross-encoder reranker — adds ~100 ms of latency and lifts Recall@5 by 15 points on standard benchmarks. Get that one right and most “our RAG is bad” complaints disappear.

TL;DR

  • RAG = retrieval + generation. Pull the most relevant chunks for a question, stuff them in the prompt, let the model answer.
  • The pipeline: chunk documents → embed → store → retrieve top-k → re-rank → stuff into prompt → generate.
  • Hybrid retrieval (dense + BM25 fused via RRF) beats either alone. Dense gets meaning; BM25 gets exact keywords.
  • A cross-encoder re-ranker is the highest-leverage stage. It sees the query and chunks together and re-scores. Adds 100 ms, can lift Recall@5 by 15+ points.
  • Chunking is underrated. Recursive character splitting at ~400 tokens with 50-token overlap is the sane default. Semantic chunking is sometimes worth it; rarely worth it from day one.

Mental model

Five stages, each with a knob. Most teams build the diagram once, then tune the knobs forever.

The five stages, in code

from sentence_transformers import SentenceTransformer, CrossEncoder import faiss from rank_bm25 import BM25Okapi import numpy as np # ---------- 1. Chunk ---------- def chunk_recursive(text, size=400, overlap=50): words = text.split() out = [] i = 0 while i < len(words): out.append(" ".join(words[i:i + size])) i += size - overlap return out corpus = [chunk for doc in documents for chunk in chunk_recursive(doc)] # ---------- 2. Embed ---------- embedder = SentenceTransformer("BAAI/bge-small-en-v1.5") embeddings = embedder.encode(corpus, normalize_embeddings=True) # ---------- 3. Store ---------- index = faiss.IndexFlatIP(embeddings.shape[1]) # inner product = cosine since normalized index.add(embeddings) # Also build a BM25 index for hybrid search. bm25 = BM25Okapi([c.split() for c in corpus]) # ---------- 4. Retrieve (hybrid) ---------- def retrieve(query, k=20): qv = embedder.encode([query], normalize_embeddings=True) _, dense_ids = index.search(qv, k) dense_ids = dense_ids[0].tolist() bm25_scores = bm25.get_scores(query.split()) bm25_ids = np.argsort(bm25_scores)[-k:][::-1].tolist() # Reciprocal Rank Fusion: 1 / (60 + rank) rrf = {} for rank, idx in enumerate(dense_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank) for rank, idx in enumerate(bm25_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank) fused = sorted(rrf.items(), key=lambda kv: -kv[1])[:k] return [corpus[i] for i, _ in fused] # ---------- 5. Rerank ---------- reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") def rerank(query, candidates, top_n=5): pairs = [(query, c) for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) return [c for c, _ in ranked[:top_n]] # ---------- 6. Generate ---------- def answer(query): candidates = retrieve(query, k=20) top = rerank(query, candidates, top_n=5) context = "\\n\\n".join(f"[{i+1}] {c}" for i, c in enumerate(top)) prompt = f"Use only the context to answer. Cite [n] for each claim.\\n\\nContext:\\n{context}\\n\\nQuestion: {query}\\n\\nAnswer:" return call_llm(prompt)

Each stage has knobs: chunk size, chunk overlap, embedding model, vector index type, top-k, RRF constant k=60, reranker top-n, prompt template. The whole pipeline is ~80 lines and zero LangChain.

Real numbers

On the SciFact benchmark (claim verification, smaller is harder):

StageRecall@5p50 latency
Dense only64%20 ms
BM25 only58%5 ms
Hybrid + 71%25 ms
Hybrid + 86%130 ms

The reranker is doing 15 percentage points of work for ~100 ms of latency. It’s the single highest-leverage component in the stack.

Run it in your browser

A pure-Python RAG demo using cosine similarity over hand-crafted embeddings. No external APIs.

Python — editableToy RAG so you can see the pipeline end-to-end.
Ctrl+Enter to run

Quick check

Quick check
Your RAG system retrieves the right document at rank 12, but you only feed the top 3 to the model. What stage do you fix?

Key takeaways

  1. Build the simplest RAG first (recursive chunking + dense retrieval + top-3) before reaching for advanced techniques.
  2. Hybrid + rerank is the real default for production. Plain dense retrieval has well-known failure modes on numbers, code, and exact identifiers.
  3. Cite your context. Always instruct the model to attribute claims to context indices. It dramatically reduces hallucinations and gives users something to verify.
  4. Eval honestly. Use a real eval set (50–200 questions with reference answers) and metrics like faithfulness + answer-relevance. Vibes are not enough.
  5. Chunking changes everything. When RAG is bad, chunking is the most likely cause; only after fixing chunking should you tune the retriever.

Go deeper

TL;DR

  • RAG = retrieval + generation. Pull the most relevant chunks for a question, stuff them in the prompt, let the model answer.
  • The pipeline: chunk documents → embed → store → retrieve top-k → re-rank → stuff into prompt → generate.
  • Hybrid retrieval (dense + BM25 fused via RRF) beats either alone. Dense gets meaning; BM25 gets exact keywords.
  • A cross-encoder re-ranker is the highest-leverage stage. It sees the query and chunks together and re-scores. Adds 100 ms, can lift Recall@5 by 15+ points.
  • Chunking is underrated. Recursive character splitting at ~400 tokens with 50-token overlap is the sane default. Semantic chunking is sometimes worth it; rarely worth it from day one.

Why this matters

A frontier model’s knowledge cutoff is many months ago. Your private documents were never in training. Without retrieval, the model can only fabricate answers about both. RAG is the most-deployed pattern in AI engineering right now because it solves a real problem with off-the-shelf parts: give the model relevant information at inference time.

Done well, RAG turns an LLM into a domain expert overnight. Done poorly, it hallucinates with footnotes.

Mental model

Five stages, each with a knob. Most teams build the diagram once, then tune the knobs forever.

Concrete walkthrough — RAG over arXiv abstracts

from sentence_transformers import SentenceTransformer, CrossEncoder import faiss from rank_bm25 import BM25Okapi import numpy as np # ---------- 1. Chunk ---------- def chunk_recursive(text, size=400, overlap=50): words = text.split() out = [] i = 0 while i < len(words): out.append(" ".join(words[i:i + size])) i += size - overlap return out corpus = [chunk for doc in documents for chunk in chunk_recursive(doc)] # ---------- 2. Embed ---------- embedder = SentenceTransformer("BAAI/bge-small-en-v1.5") embeddings = embedder.encode(corpus, normalize_embeddings=True) # ---------- 3. Store ---------- index = faiss.IndexFlatIP(embeddings.shape[1]) # inner product = cosine since normalized index.add(embeddings) # Also build a BM25 index for hybrid search. bm25 = BM25Okapi([c.split() for c in corpus]) # ---------- 4. Retrieve (hybrid) ---------- def retrieve(query, k=20): qv = embedder.encode([query], normalize_embeddings=True) _, dense_ids = index.search(qv, k) dense_ids = dense_ids[0].tolist() bm25_scores = bm25.get_scores(query.split()) bm25_ids = np.argsort(bm25_scores)[-k:][::-1].tolist() # Reciprocal Rank Fusion: 1 / (60 + rank) rrf = {} for rank, idx in enumerate(dense_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank) for rank, idx in enumerate(bm25_ids): rrf[idx] = rrf.get(idx, 0) + 1 / (60 + rank) fused = sorted(rrf.items(), key=lambda kv: -kv[1])[:k] return [corpus[i] for i, _ in fused] # ---------- 5. Rerank ---------- reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") def rerank(query, candidates, top_n=5): pairs = [(query, c) for c in candidates] scores = reranker.predict(pairs) ranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) return [c for c, _ in ranked[:top_n]] # ---------- 6. Generate ---------- def answer(query): candidates = retrieve(query, k=20) top = rerank(query, candidates, top_n=5) context = "\\n\\n".join(f"[{i+1}] {c}" for i, c in enumerate(top)) prompt = f"Use only the context to answer. Cite [n] for each claim.\\n\\nContext:\\n{context}\\n\\nQuestion: {query}\\n\\nAnswer:" return call_llm(prompt)

Real numbers on the SciFact benchmark (claim verification, smaller is harder):

StageRecall@5p50 latency
Dense only64%20 ms
BM25 only58%5 ms
Hybrid + RRF71%25 ms
Hybrid + Rerank86%130 ms

The reranker is doing 15 percentage points of work. It’s the single highest-leverage component in the stack.

Run it in your browser

A pure-Python RAG demo using cosine similarity over hand-crafted embeddings. No external APIs.

Python — editableToy RAG so you can see the pipeline end-to-end.
Ctrl+Enter to run

Quick check

Quick check
Your RAG system retrieves the right document at rank 12, but you only feed the top 3 to the model. What stage do you fix?

Key takeaways

  1. Build the simplest RAG first (recursive chunking + dense retrieval + top-3) before reaching for advanced techniques.
  2. Hybrid + rerank is the real default for production. Plain dense retrieval has well-known failure modes on numbers, code, and exact identifiers.
  3. Cite your context. Always instruct the model to attribute claims to context indices. It dramatically reduces hallucinations and gives users something to verify.
  4. Eval honestly. Use a real eval set (50–200 questions with reference answers) and metrics like faithfulness + answer-relevance. Vibes are not enough.
  5. Chunking changes everything. When RAG is bad, chunking is the most likely cause; only after fixing chunking should you tune the retriever.

Go deeper