Skip to content

Advanced RAG — HyDE, GraphRAG, Agentic Retrieval

Naive is one cosine-similarity sort: top_5 = index.query(vector=embed(question), top_k=5). Production RAG is a pipeline — sometimes a tree. An LLM rewrites the user query into a one-paragraph fake answer () so the embedding lands in the document’s vocabulary instead of the user’s jargon. A planner decomposes one question into 2–4 sub-questions retrieved separately. A graph index returns community summaries instead of for “what did the authors conclude across their five papers?” style questions. A loop lets the model issue its own searches when it doesn’t know which sub-query to ask first.

Each of these is a response to a specific failure mode of naive RAG — vocab mismatch, multi-hop, global questions, open-ended exploration. The diagnostic question when your eval shows RAG failing is which failure mode: that determines which technique to reach for. Throwing all four at every query is how teams burn 8× the cost for 2 percentage points of accuracy. This lesson is the four most-deployed techniques in 2026, the failure each one fixes, and when each one hurts.

TL;DR

  • Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
  • HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
  • Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
  • GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on “global” questions a chunk can’t answer alone.
  • Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.

The four failure modes

Naive RAG (embed → top-k → stuff in prompt → answer) hits a ceiling fast. The failures look like:

  • “What does the 2023 paper compare to the 2021 baseline?” → multi-hop
  • “Tell me about token routing” → vague query, all chunks score similar
  • “What did the authors conclude across their five papers?” → global question, no single chunk has the answer
  • “What is mentioned in the section before the conclusion?” → structural reference

Each is a known failure mode with a known fix. The four below are the most-deployed responses in 2026 production RAG.

Mental model

These aren’t mutually exclusive — production systems often combine HyDE for vocab mismatch and decomposition for multi-hop, with the agent loop as fallback for the rest.

HyDE — bridge the query/document vocab gap

The query “How does GQA reduce memory?” is short and uses jargon. The actual document chunk says “Grouped-query attention shares key and value projections across head groups, reducing cache size by a factor of H/GH/G.” Cosine sim between query and chunk is mediocre; cosine sim between fake answer and chunk is excellent.

def hyde(query: str, llm, embedder, retriever) -> list[str]: fake_answer = llm.generate( f"Write a one-paragraph answer to: {query}\n\n" f"It's OK to be wrong on details — the goal is realistic phrasing." ) return retriever.retrieve(embedder.encode(fake_answer), k=10)

When it helps: short queries on long technical chunks, queries in a different language than the corpus, queries written by users who don’t know the field’s vocabulary.

When it hurts: the LLM hallucinates a wrong premise, biasing retrieval toward irrelevant docs. Mitigate by retrieving against both query and HyDE answer, fusing with .

Query decomposition

def decompose_and_retrieve(query: str, llm, retriever): plan = llm.generate( f"Break this question into the smallest sub-questions whose answers can be combined.\n" f"Question: {query}\nReturn JSON list." ) sub_qs = json.loads(plan) contexts = [] for q in sub_qs: contexts.extend(retriever.retrieve(q, k=3)) return contexts, sub_qs

For “Compare DeepSeek-V3’s MLA to standard GQA on memory and quality,” decomposition produces:

  1. What is MLA?
  2. What is standard GQA?
  3. What’s the memory cost of MLA vs GQA?
  4. What’s the quality cost of MLA vs GQA?

Retrieving 3 chunks per sub-question and combining beats retrieving 12 chunks for the original. Recall@3 is much higher per sub-question than recall@12 for an over-complex query.

GraphRAG — for global / corpus-wide questions

Microsoft’s GraphRAG (2024) inverts the indexing strategy:

  1. Index time: extract (entity, relation, entity) triples from each chunk; cluster the resulting graph into communities (Leiden algorithm); have the LLM summarize each community.
  2. Query time: retrieve community summaries (not chunks) and answer over those.
# Conceptual — see the paper for the actual pipeline graph = extract_entities_and_relations(corpus) communities = leiden_cluster(graph, levels=[0, 1, 2]) # multi-resolution for c in communities: c.summary = llm.summarize(c.nodes_and_edges, target_words=400) def graphrag_query(query, communities): relevant = [c for c in communities if relevance(query, c.summary) > THRESHOLD] return llm.synthesize(query, [c.summary for c in relevant])

Best for: corpora where the answer requires aggregating across many documents (annual reports across years, themes across a paper collection, cross-document narrative arcs). Bad for: factual lookups inside a single document — GraphRAG is overkill and the community summary loses precision.

Agentic retrieval

Let the model issue its own searches in a loop:

def agentic_retrieve(question, retriever, llm, max_steps=4): notes = [] for step in range(max_steps): plan = llm.generate( f"Question: {question}\n" f"Notes so far: {notes}\n" f"Either output JSON {{search: '...'}} or {{answer: '...'}}." ) plan = json.loads(plan) if 'answer' in plan: return plan['answer'] chunks = retriever.retrieve(plan['search'], k=5) notes.append({'search': plan['search'], 'results': summarize(chunks)}) return llm.generate(f"Question: {question}\nNotes: {notes}\nAnswer:")

This is what powers most production “deep research” agents. Highest ceiling — the model can iteratively zoom in. Highest cost — N model calls per question. Hardest to evaluate — every trajectory differs.

Real-world numbers (HotpotQA multi-hop benchmark)

TechniqueRecall@5EM accuracyCost (relative)
Naive RAG38%28%
Hybrid + Rerank51%36%1.2×
HyDE56%41%
Query decomposition64%52%
Agentic (4 steps)72%58%
GraphRAG (when applicable)64%varies

(Numbers are illustrative — published benchmarks for each technique vary by corpus.)

The right pick is task-dependent. Default to hybrid + rerank (see RAG Fundamentals); reach for these only when you’ve measured failure modes that match.

Run it in your browser — see HyDE close the embedding gap

The mechanic: a short jargon query has poor cosine sim with a long technical chunk; a hypothetical answer in the chunk’s register matches it much better. We hard-code the fake answer (in production an LLM generates it) and watch the score jump.

Python — editableToy bag-of-words 'embedder' showing why HyDE retrieves better than the raw query.
Ctrl+Enter to run

The takeaway: HyDE works because embedded space proximity is shaped by vocabulary as much as semantics. The fake answer brings the query into the chunk’s neighborhood. RRF-fusing both query and hyde_answer retrievals is the production pattern — robust when the LLM hallucinates a wrong premise.

Quick check

Quick check
Your RAG pipeline answers single-fact questions correctly but fails on 'What's the trend across these five papers?'-style questions. What's the most appropriate fix?

Key takeaways

  1. Naive RAG has 4 known failure modes. Diagnose which you’re hitting before reaching for a technique.
  2. HyDE for vocab mismatch, decomposition for multi-hop, GraphRAG for global questions, agents for open-ended exploration. Match the technique to the failure.
  3. Always layer on top of hybrid + rerank, not as a replacement. Naive retrieval still does most of the work.
  4. Cost grows fast. A 4-step agent costs 8× a single retrieval. Production systems route easy queries to the cheap path and reserve advanced techniques for the hard ones.
  5. Eval honestly. Build a 50-question test set with the failure modes labeled. Vibes-based RAG tuning is how teams ship regressions.

Go deeper

TL;DR

  • Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
  • HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
  • Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
  • GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on “global” questions a chunk can’t answer alone.
  • Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.

Why this matters

Naive RAG (embed → top-k → stuff in prompt → answer) hits a ceiling fast. The failures look like:

  • “What does the 2023 paper compare to the 2021 baseline?” → multi-hop
  • “Tell me about token routing” → vague query, all chunks score similar
  • “What did the authors conclude across their five papers?” → global question, no single chunk has the answer
  • “What is mentioned in the section before the conclusion?” → structural reference

Each of these is a known failure mode with a known fix. This lesson covers the four most-deployed ones in 2026 production RAG.

Mental model

These aren’t mutually exclusive — production systems often combine HyDE for vocab mismatch and decomposition for multi-hop, with the agent loop as fallback for the rest.

Concrete walkthrough — the four techniques, in 200 lines each

1. HyDE — bridge the query/document vocab gap

The query “How does GQA reduce memory?” is short and uses jargon. The actual document chunk says “Grouped-query attention shares key and value projections across head groups, reducing cache size by a factor of H/GH/G.” Cosine sim between query and chunk is mediocre; cosine sim between fake answer and chunk is excellent.

def hyde(query: str, llm, embedder, retriever) -> list[str]: fake_answer = llm.generate( f"Write a one-paragraph answer to: {query}\n\n" f"It's OK to be wrong on details — the goal is realistic phrasing." ) return retriever.retrieve(embedder.encode(fake_answer), k=10)

When it helps: short queries on long technical chunks, queries in a different language than the corpus, queries written by users who don’t know the field’s vocabulary.

When it hurts: the LLM hallucinates a wrong premise, biasing retrieval toward irrelevant docs. Mitigate by retrieving against both query and HyDE answer, fusing with RRF.

2. Query decomposition

def decompose_and_retrieve(query: str, llm, retriever): plan = llm.generate( f"Break this question into the smallest sub-questions whose answers can be combined.\n" f"Question: {query}\nReturn JSON list." ) sub_qs = json.loads(plan) contexts = [] for q in sub_qs: contexts.extend(retriever.retrieve(q, k=3)) return contexts, sub_qs

For “Compare DeepSeek-V3’s MLA to standard GQA on memory and quality,” decomposition produces:

  1. What is MLA?
  2. What is standard GQA?
  3. What’s the memory cost of MLA vs GQA?
  4. What’s the quality cost of MLA vs GQA?

Retrieving 3 chunks per sub-question and combining beats retrieving 12 chunks for the original. Recall@3 is much higher per sub-question than recall@12 for an over-complex query.

3. GraphRAG — for global / corpus-wide questions

Microsoft’s GraphRAG (2024) inverts the indexing strategy:

  1. Index time: extract (entity, relation, entity) triples from each chunk; cluster the resulting graph into communities (Leiden algorithm); have the LLM summarize each community.
  2. Query time: retrieve community summaries (not chunks) and answer over those.
# Conceptual — see the paper for the actual pipeline graph = extract_entities_and_relations(corpus) communities = leiden_cluster(graph, levels=[0, 1, 2]) # multi-resolution for c in communities: c.summary = llm.summarize(c.nodes_and_edges, target_words=400) def graphrag_query(query, communities): relevant = [c for c in communities if relevance(query, c.summary) > THRESHOLD] return llm.synthesize(query, [c.summary for c in relevant])

Best for: corpora where the answer requires aggregating across many documents (annual reports across years, themes across a paper collection, cross-document narrative arcs). Bad for: factual lookups inside a single document — GraphRAG is overkill and the community summary loses precision.

4. Agentic retrieval

Let the model issue its own searches in a loop:

def agentic_retrieve(question, retriever, llm, max_steps=4): notes = [] for step in range(max_steps): plan = llm.generate( f"Question: {question}\n" f"Notes so far: {notes}\n" f"Either output JSON {{search: '...'}} or {{answer: '...'}}." ) plan = json.loads(plan) if 'answer' in plan: return plan['answer'] chunks = retriever.retrieve(plan['search'], k=5) notes.append({'search': plan['search'], 'results': summarize(chunks)}) return llm.generate(f"Question: {question}\nNotes: {notes}\nAnswer:")

This is what powers most production “deep research” agents. Highest ceiling — the model can iteratively zoom in. Highest cost — N model calls per question. Hardest to evaluate — every trajectory differs.

Real-world numbers (HotpotQA multi-hop benchmark)

TechniqueRecall@5EM accuracyCost (relative)
Naive RAG38%28%
Hybrid + Rerank51%36%1.2×
HyDE56%41%
Query decomposition64%52%
Agentic (4 steps)72%58%
GraphRAG (when applicable)64%varies

(Numbers are illustrative — published benchmarks for each technique vary by corpus.)

The right pick is task-dependent. Default to hybrid + rerank (Lesson: RAG Fundamentals); reach for these only when you’ve measured failure modes that match.

Run it in your browser — see HyDE close the embedding gap

The mechanic: a short jargon query has poor cosine sim with a long technical chunk; a hypothetical answer in the chunk’s register matches it much better. We hard-code the fake answer (in production an LLM generates it) and watch the score jump.

Python — editableToy bag-of-words 'embedder' showing why HyDE retrieves better than the raw query.
Ctrl+Enter to run

The takeaway: HyDE works because embedded space proximity is shaped by vocabulary as much as semantics. The fake answer brings the query into the chunk’s neighborhood. RRF-fusing both query and hyde_answer retrievals is the production pattern — robust when the LLM hallucinates a wrong premise.

Quick check

Quick check
Your RAG pipeline answers single-fact questions correctly but fails on 'What's the trend across these five papers?'-style questions. What's the most appropriate fix?

Key takeaways

  1. Naive RAG has 4 known failure modes. Diagnose which you’re hitting before reaching for a technique.
  2. HyDE for vocab mismatch, decomposition for multi-hop, GraphRAG for global questions, agents for open-ended exploration. Match the technique to the failure.
  3. Always layer on top of hybrid + rerank, not as a replacement. Naive retrieval still does most of the work.
  4. Cost grows fast. A 4-step agent costs 8× a single retrieval. Production systems route easy queries to the cheap path and reserve advanced techniques for the hard ones.
  5. Eval honestly. Build a 50-question test set with the failure modes labeled. Vibes-based RAG tuning is how teams ship regressions.

Go deeper