Advanced RAG — HyDE, GraphRAG, Agentic Retrieval
Naive is one cosine-similarity sort: top_5 = index.query(vector=embed(question), top_k=5). Production RAG is a pipeline — sometimes a tree. An LLM rewrites the user query into a one-paragraph fake answer () so the embedding lands in the document’s vocabulary instead of the user’s jargon. A planner decomposes one question into 2–4 sub-questions retrieved separately. A graph index returns community summaries instead of for “what did the authors conclude across their five papers?” style questions. A loop lets the model issue its own searches when it doesn’t know which sub-query to ask first.
Each of these is a response to a specific failure mode of naive RAG — vocab mismatch, multi-hop, global questions, open-ended exploration. The diagnostic question when your eval shows RAG failing is which failure mode: that determines which technique to reach for. Throwing all four at every query is how teams burn 8× the cost for 2 percentage points of accuracy. This lesson is the four most-deployed techniques in 2026, the failure each one fixes, and when each one hurts.
TL;DR
- Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
- HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
- Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
- GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on “global” questions a chunk can’t answer alone.
- Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.
The four failure modes
Naive RAG (embed → top-k → stuff in prompt → answer) hits a ceiling fast. The failures look like:
- “What does the 2023 paper compare to the 2021 baseline?” → multi-hop
- “Tell me about token routing” → vague query, all chunks score similar
- “What did the authors conclude across their five papers?” → global question, no single chunk has the answer
- “What is mentioned in the section before the conclusion?” → structural reference
Each is a known failure mode with a known fix. The four below are the most-deployed responses in 2026 production RAG.
Mental model
These aren’t mutually exclusive — production systems often combine HyDE for vocab mismatch and decomposition for multi-hop, with the agent loop as fallback for the rest.
HyDE — bridge the query/document vocab gap
The query “How does GQA reduce memory?” is short and uses jargon. The actual document chunk says “Grouped-query attention shares key and value projections across head groups, reducing cache size by a factor of .” Cosine sim between query and chunk is mediocre; cosine sim between fake answer and chunk is excellent.
def hyde(query: str, llm, embedder, retriever) -> list[str]:
fake_answer = llm.generate(
f"Write a one-paragraph answer to: {query}\n\n"
f"It's OK to be wrong on details — the goal is realistic phrasing."
)
return retriever.retrieve(embedder.encode(fake_answer), k=10)When it helps: short queries on long technical chunks, queries in a different language than the corpus, queries written by users who don’t know the field’s vocabulary.
When it hurts: the LLM hallucinates a wrong premise, biasing retrieval toward irrelevant docs. Mitigate by retrieving against both query and HyDE answer, fusing with .
Query decomposition
def decompose_and_retrieve(query: str, llm, retriever):
plan = llm.generate(
f"Break this question into the smallest sub-questions whose answers can be combined.\n"
f"Question: {query}\nReturn JSON list."
)
sub_qs = json.loads(plan)
contexts = []
for q in sub_qs:
contexts.extend(retriever.retrieve(q, k=3))
return contexts, sub_qsFor “Compare DeepSeek-V3’s MLA to standard GQA on memory and quality,” decomposition produces:
- What is MLA?
- What is standard GQA?
- What’s the memory cost of MLA vs GQA?
- What’s the quality cost of MLA vs GQA?
Retrieving 3 chunks per sub-question and combining beats retrieving 12 chunks for the original. Recall@3 is much higher per sub-question than recall@12 for an over-complex query.
GraphRAG — for global / corpus-wide questions
Microsoft’s GraphRAG (2024) inverts the indexing strategy:
- Index time: extract (entity, relation, entity) triples from each chunk; cluster the resulting graph into communities (Leiden algorithm); have the LLM summarize each community.
- Query time: retrieve community summaries (not chunks) and answer over those.
# Conceptual — see the paper for the actual pipeline
graph = extract_entities_and_relations(corpus)
communities = leiden_cluster(graph, levels=[0, 1, 2]) # multi-resolution
for c in communities:
c.summary = llm.summarize(c.nodes_and_edges, target_words=400)
def graphrag_query(query, communities):
relevant = [c for c in communities if relevance(query, c.summary) > THRESHOLD]
return llm.synthesize(query, [c.summary for c in relevant])Best for: corpora where the answer requires aggregating across many documents (annual reports across years, themes across a paper collection, cross-document narrative arcs). Bad for: factual lookups inside a single document — GraphRAG is overkill and the community summary loses precision.
Agentic retrieval
Let the model issue its own searches in a loop:
def agentic_retrieve(question, retriever, llm, max_steps=4):
notes = []
for step in range(max_steps):
plan = llm.generate(
f"Question: {question}\n"
f"Notes so far: {notes}\n"
f"Either output JSON {{search: '...'}} or {{answer: '...'}}."
)
plan = json.loads(plan)
if 'answer' in plan:
return plan['answer']
chunks = retriever.retrieve(plan['search'], k=5)
notes.append({'search': plan['search'], 'results': summarize(chunks)})
return llm.generate(f"Question: {question}\nNotes: {notes}\nAnswer:")This is what powers most production “deep research” agents. Highest ceiling — the model can iteratively zoom in. Highest cost — N model calls per question. Hardest to evaluate — every trajectory differs.
Real-world numbers (HotpotQA multi-hop benchmark)
| Technique | Recall@5 | EM accuracy | Cost (relative) |
|---|---|---|---|
| Naive RAG | 38% | 28% | 1× |
| Hybrid + Rerank | 51% | 36% | 1.2× |
| HyDE | 56% | 41% | 2× |
| Query decomposition | 64% | 52% | 3× |
| Agentic (4 steps) | 72% | 58% | 8× |
| GraphRAG (when applicable) | — | 64% | varies |
(Numbers are illustrative — published benchmarks for each technique vary by corpus.)
The right pick is task-dependent. Default to hybrid + rerank (see RAG Fundamentals); reach for these only when you’ve measured failure modes that match.
Run it in your browser — see HyDE close the embedding gap
The mechanic: a short jargon query has poor cosine sim with a long technical chunk; a hypothetical answer in the chunk’s register matches it much better. We hard-code the fake answer (in production an LLM generates it) and watch the score jump.
The takeaway: HyDE works because embedded space proximity is shaped by vocabulary as much as semantics. The fake answer brings the query into the chunk’s neighborhood. RRF-fusing both query and hyde_answer retrievals is the production pattern — robust when the LLM hallucinates a wrong premise.
Quick check
Key takeaways
- Naive RAG has 4 known failure modes. Diagnose which you’re hitting before reaching for a technique.
- HyDE for vocab mismatch, decomposition for multi-hop, GraphRAG for global questions, agents for open-ended exploration. Match the technique to the failure.
- Always layer on top of hybrid + rerank, not as a replacement. Naive retrieval still does most of the work.
- Cost grows fast. A 4-step agent costs 8× a single retrieval. Production systems route easy queries to the cheap path and reserve advanced techniques for the hard ones.
- Eval honestly. Build a 50-question test set with the failure modes labeled. Vibes-based RAG tuning is how teams ship regressions.
Go deeper
- PaperPrecise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)The original HyDE paper. Short and clear.
- PaperFrom Local to Global: A GraphRAG Approach to Query-Focused SummarizationThe GraphRAG paper. Read it after trying naive RAG and feeling its limits.
- PaperSelf-RAG: Learning to Retrieve, Generate, and CritiqueTrains the model to decide *when* to retrieve. Influential for agentic patterns.
- Repomicrosoft/graphragReference implementation. Read `graphrag/index/` for how the indexing pipeline actually runs.
- BlogContextual RetrievalA simpler alternative to GraphRAG: prepend a short LLM-generated context to each chunk before embedding. Surprisingly strong baseline.
- DocsLlamaIndex — Advanced RetrievalPractical recipes for each technique with working code.
- VideoGreg Kamradt — RAG From ScratchA 12-part series. Episodes 5-9 cover the techniques here with notebook walkthroughs.
TL;DR
- Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
- HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
- Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
- GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on “global” questions a chunk can’t answer alone.
- Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.
Why this matters
Naive RAG (embed → top-k → stuff in prompt → answer) hits a ceiling fast. The failures look like:
- “What does the 2023 paper compare to the 2021 baseline?” → multi-hop
- “Tell me about token routing” → vague query, all chunks score similar
- “What did the authors conclude across their five papers?” → global question, no single chunk has the answer
- “What is mentioned in the section before the conclusion?” → structural reference
Each of these is a known failure mode with a known fix. This lesson covers the four most-deployed ones in 2026 production RAG.
Mental model
These aren’t mutually exclusive — production systems often combine HyDE for vocab mismatch and decomposition for multi-hop, with the agent loop as fallback for the rest.
Concrete walkthrough — the four techniques, in 200 lines each
1. HyDE — bridge the query/document vocab gap
The query “How does GQA reduce memory?” is short and uses jargon. The actual document chunk says “Grouped-query attention shares key and value projections across head groups, reducing cache size by a factor of .” Cosine sim between query and chunk is mediocre; cosine sim between fake answer and chunk is excellent.
def hyde(query: str, llm, embedder, retriever) -> list[str]:
fake_answer = llm.generate(
f"Write a one-paragraph answer to: {query}\n\n"
f"It's OK to be wrong on details — the goal is realistic phrasing."
)
return retriever.retrieve(embedder.encode(fake_answer), k=10)When it helps: short queries on long technical chunks, queries in a different language than the corpus, queries written by users who don’t know the field’s vocabulary.
When it hurts: the LLM hallucinates a wrong premise, biasing retrieval toward irrelevant docs. Mitigate by retrieving against both query and HyDE answer, fusing with RRF.
2. Query decomposition
def decompose_and_retrieve(query: str, llm, retriever):
plan = llm.generate(
f"Break this question into the smallest sub-questions whose answers can be combined.\n"
f"Question: {query}\nReturn JSON list."
)
sub_qs = json.loads(plan)
contexts = []
for q in sub_qs:
contexts.extend(retriever.retrieve(q, k=3))
return contexts, sub_qsFor “Compare DeepSeek-V3’s MLA to standard GQA on memory and quality,” decomposition produces:
- What is MLA?
- What is standard GQA?
- What’s the memory cost of MLA vs GQA?
- What’s the quality cost of MLA vs GQA?
Retrieving 3 chunks per sub-question and combining beats retrieving 12 chunks for the original. Recall@3 is much higher per sub-question than recall@12 for an over-complex query.
3. GraphRAG — for global / corpus-wide questions
Microsoft’s GraphRAG (2024) inverts the indexing strategy:
- Index time: extract (entity, relation, entity) triples from each chunk; cluster the resulting graph into communities (Leiden algorithm); have the LLM summarize each community.
- Query time: retrieve community summaries (not chunks) and answer over those.
# Conceptual — see the paper for the actual pipeline
graph = extract_entities_and_relations(corpus)
communities = leiden_cluster(graph, levels=[0, 1, 2]) # multi-resolution
for c in communities:
c.summary = llm.summarize(c.nodes_and_edges, target_words=400)
def graphrag_query(query, communities):
relevant = [c for c in communities if relevance(query, c.summary) > THRESHOLD]
return llm.synthesize(query, [c.summary for c in relevant])Best for: corpora where the answer requires aggregating across many documents (annual reports across years, themes across a paper collection, cross-document narrative arcs). Bad for: factual lookups inside a single document — GraphRAG is overkill and the community summary loses precision.
4. Agentic retrieval
Let the model issue its own searches in a loop:
def agentic_retrieve(question, retriever, llm, max_steps=4):
notes = []
for step in range(max_steps):
plan = llm.generate(
f"Question: {question}\n"
f"Notes so far: {notes}\n"
f"Either output JSON {{search: '...'}} or {{answer: '...'}}."
)
plan = json.loads(plan)
if 'answer' in plan:
return plan['answer']
chunks = retriever.retrieve(plan['search'], k=5)
notes.append({'search': plan['search'], 'results': summarize(chunks)})
return llm.generate(f"Question: {question}\nNotes: {notes}\nAnswer:")This is what powers most production “deep research” agents. Highest ceiling — the model can iteratively zoom in. Highest cost — N model calls per question. Hardest to evaluate — every trajectory differs.
Real-world numbers (HotpotQA multi-hop benchmark)
| Technique | Recall@5 | EM accuracy | Cost (relative) |
|---|---|---|---|
| Naive RAG | 38% | 28% | 1× |
| Hybrid + Rerank | 51% | 36% | 1.2× |
| HyDE | 56% | 41% | 2× |
| Query decomposition | 64% | 52% | 3× |
| Agentic (4 steps) | 72% | 58% | 8× |
| GraphRAG (when applicable) | — | 64% | varies |
(Numbers are illustrative — published benchmarks for each technique vary by corpus.)
The right pick is task-dependent. Default to hybrid + rerank (Lesson: RAG Fundamentals); reach for these only when you’ve measured failure modes that match.
Run it in your browser — see HyDE close the embedding gap
The mechanic: a short jargon query has poor cosine sim with a long technical chunk; a hypothetical answer in the chunk’s register matches it much better. We hard-code the fake answer (in production an LLM generates it) and watch the score jump.
The takeaway: HyDE works because embedded space proximity is shaped by vocabulary as much as semantics. The fake answer brings the query into the chunk’s neighborhood. RRF-fusing both query and hyde_answer retrievals is the production pattern — robust when the LLM hallucinates a wrong premise.
Quick check
Key takeaways
- Naive RAG has 4 known failure modes. Diagnose which you’re hitting before reaching for a technique.
- HyDE for vocab mismatch, decomposition for multi-hop, GraphRAG for global questions, agents for open-ended exploration. Match the technique to the failure.
- Always layer on top of hybrid + rerank, not as a replacement. Naive retrieval still does most of the work.
- Cost grows fast. A 4-step agent costs 8× a single retrieval. Production systems route easy queries to the cheap path and reserve advanced techniques for the hard ones.
- Eval honestly. Build a 50-question test set with the failure modes labeled. Vibes-based RAG tuning is how teams ship regressions.
Go deeper
- PaperPrecise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)The original HyDE paper. Short and clear.
- PaperFrom Local to Global: A GraphRAG Approach to Query-Focused SummarizationThe GraphRAG paper. Read it after trying naive RAG and feeling its limits.
- PaperSelf-RAG: Learning to Retrieve, Generate, and CritiqueTrains the model to decide *when* to retrieve. Influential for agentic patterns.
- Repomicrosoft/graphragReference implementation. Read `graphrag/index/` for how the indexing pipeline actually runs.
- BlogContextual RetrievalA simpler alternative to GraphRAG: prepend a short LLM-generated context to each chunk before embedding. Surprisingly strong baseline.
- DocsLlamaIndex — Advanced RetrievalPractical recipes for each technique with working code.
- VideoGreg Kamradt — RAG From ScratchA 12-part series. Episodes 5-9 cover the techniques here with notebook walkthroughs.