Advanced RAG — HyDE, GraphRAG, Agentic Retrieval

Naive is one cosine-similarity sort: top_5 = index.query(vector=embed(question), top_k=5). Production RAG is a pipeline — sometimes a tree. An LLM rewrites the user query into a one-paragraph fake answer () so the embedding lands in the document’s vocabulary instead of the user’s jargon. A planner decomposes one question into 2–4 sub-questions retrieved separately. A graph index returns community summaries instead of for “what did the authors conclude across their five papers?” style questions. A loop lets the model issue its own searches when it doesn’t know which sub-query to ask first.

Each of these is a response to a specific failure mode of naive RAG — vocab mismatch, multi-hop, global questions, open-ended exploration. The diagnostic question when your eval shows RAG failing is which failure mode: that determines which technique to reach for. Throwing all four at every query is how teams burn 8× the cost for 2 percentage points of accuracy. This lesson is the four most-deployed techniques in 2026, the failure each one fixes, and when each one hurts.

TL;DR

Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on “global” questions a chunk can’t answer alone.
Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.

The four failure modes

Naive RAG (embed → top-k → stuff in prompt → answer) hits a ceiling fast. The failures look like:

“What does the 2023 paper compare to the 2021 baseline?” → multi-hop
“Tell me about token routing” → vague query, all chunks score similar
“What did the authors conclude across their five papers?” → global question, no single chunk has the answer
“What is mentioned in the section before the conclusion?” → structural reference

Each is a known failure mode with a known fix. The four below are the most-deployed responses in 2026 production RAG.

Mental model

These aren’t mutually exclusive — production systems often combine HyDE for vocab mismatch and decomposition for multi-hop, with the agent loop as fallback for the rest.

HyDE — bridge the query/document vocab gap

The query “How does GQA reduce memory?” is short and uses jargon. The actual document chunk says “Grouped-query attention shares key and value projections across head groups, reducing cache size by a factor of $H/G$ .” Cosine sim between query and chunk is mediocre; cosine sim between fake answer and chunk is excellent.


def hyde(query: str, llm, embedder, retriever) -> list[str]:
    fake_answer = llm.generate(
        f"Write a one-paragraph answer to: {query}\n\n"
        f"It's OK to be wrong on details — the goal is realistic phrasing."
    )
    return retriever.retrieve(embedder.encode(fake_answer), k=10)

When it helps: short queries on long technical chunks, queries in a different language than the corpus, queries written by users who don’t know the field’s vocabulary.

When it hurts: the LLM hallucinates a wrong premise, biasing retrieval toward irrelevant docs. Mitigate by retrieving against both query and HyDE answer, fusing with .

Query decomposition


def decompose_and_retrieve(query: str, llm, retriever):
    plan = llm.generate(
        f"Break this question into the smallest sub-questions whose answers can be combined.\n"
        f"Question: {query}\nReturn JSON list."
    )
    sub_qs = json.loads(plan)
    contexts = []
    for q in sub_qs:
        contexts.extend(retriever.retrieve(q, k=3))
    return contexts, sub_qs

For “Compare DeepSeek-V3’s MLA to standard GQA on memory and quality,” decomposition produces:

What is MLA?
What is standard GQA?
What’s the memory cost of MLA vs GQA?
What’s the quality cost of MLA vs GQA?

Retrieving 3 chunks per sub-question and combining beats retrieving 12 chunks for the original. Recall@3 is much higher per sub-question than recall@12 for an over-complex query.

GraphRAG — for global / corpus-wide questions

Microsoft’s GraphRAG (2024) inverts the indexing strategy:

Index time: extract (entity, relation, entity) triples from each chunk; cluster the resulting graph into communities (Leiden algorithm); have the LLM summarize each community.
Query time: retrieve community summaries (not chunks) and answer over those.


# Conceptual — see the paper for the actual pipeline
graph = extract_entities_and_relations(corpus)
communities = leiden_cluster(graph, levels=[0, 1, 2])  # multi-resolution
for c in communities:
    c.summary = llm.summarize(c.nodes_and_edges, target_words=400)
 
def graphrag_query(query, communities):
    relevant = [c for c in communities if relevance(query, c.summary) > THRESHOLD]
    return llm.synthesize(query, [c.summary for c in relevant])

Best for: corpora where the answer requires aggregating across many documents (annual reports across years, themes across a paper collection, cross-document narrative arcs). Bad for: factual lookups inside a single document — GraphRAG is overkill and the community summary loses precision.

Agentic retrieval

Let the model issue its own searches in a loop:


def agentic_retrieve(question, retriever, llm, max_steps=4):
    notes = []
    for step in range(max_steps):
        plan = llm.generate(
            f"Question: {question}\n"
            f"Notes so far: {notes}\n"
            f"Either output JSON {{search: '...'}} or {{answer: '...'}}."
        )
        plan = json.loads(plan)
        if 'answer' in plan:
            return plan['answer']
        chunks = retriever.retrieve(plan['search'], k=5)
        notes.append({'search': plan['search'], 'results': summarize(chunks)})
    return llm.generate(f"Question: {question}\nNotes: {notes}\nAnswer:")

This is what powers most production “deep research” agents. Highest ceiling — the model can iteratively zoom in. Highest cost — N model calls per question. Hardest to evaluate — every trajectory differs.

Real-world numbers (HotpotQA multi-hop benchmark)

Technique	Recall@5	EM accuracy	Cost (relative)
Naive RAG	38%	28%	1×
Hybrid + Rerank	51%	36%	1.2×
HyDE	56%	41%	2×
Query decomposition	64%	52%	3×
Agentic (4 steps)	72%	58%	8×
GraphRAG (when applicable)	—	64%	varies

(Numbers are illustrative — published benchmarks for each technique vary by corpus.)

The right pick is task-dependent. Default to hybrid + rerank (see RAG Fundamentals); reach for these only when you’ve measured failure modes that match.

Run it in your browser — see HyDE close the embedding gap

The mechanic: a short jargon query has poor cosine sim with a long technical chunk; a hypothetical answer in the chunk’s register matches it much better. We hard-code the fake answer (in production an LLM generates it) and watch the score jump.

Python — editableToy bag-of-words 'embedder' showing why HyDE retrieves better than the raw query.

from collections import Counter
import re, math

def embed(s):
  return Counter(re.findall(r"\w+", s.lower()))

def cosine(a, b):
  keys = set(a) | set(b)
  dot = sum(a[k] * b[k] for k in keys)
  na = sum(v*v for v in a.values()) ** 0.5
  nb = sum(v*v for v in b.values()) ** 0.5
  return dot / (na * nb + 1e-9)

# Real document chunk (technical, long).
chunk = ("Grouped-query attention shares key and value projections across head groups, "
       "reducing the KV cache by a factor of H/G where H is query heads and G is groups. "
       "Llama-3-70B uses 64 query heads and 8 KV heads — an 8x cache reduction over MHA "
       "with no measurable quality loss on standard benchmarks.")

# Short jargon query — what a user actually types.
query = "How does GQA reduce memory?"

# Hypothetical answer — what an LLM would generate from the query alone (HyDE step).
# In production, this comes from llm.generate(); here we hard-code a plausible one.
hyde_answer = ("Grouped-query attention reduces the key-value cache size by sharing "
             "key and value projections across multiple query heads, lowering memory "
             "for long-context inference.")

# Compare cosine similarities.
print(f"cos(query,        chunk) = {cosine(embed(query), embed(chunk)):.3f}")
print(f"cos(hyde-answer,  chunk) = {cosine(embed(hyde_answer), embed(chunk)):.3f}")
print()
print("The HyDE answer shares vocabulary with the chunk (\"key\", \"value\",")
print("\"projections\", \"reduce\", \"memory\") that the raw query never used.")
print("Real dense embeddings show the same effect more dramatically.")

from collections import Counter
import re, math

def embed(s):
  return Counter(re.findall(r"\w+", s.lower()))

def cosine(a, b):
  keys = set(a) | set(b)
  dot = sum(a[k] * b[k] for k in keys)
  na = sum(v*v for v in a.values()) ** 0.5
  nb = sum(v*v for v in b.values()) ** 0.5
  return dot / (na * nb + 1e-9)

# Real document chunk (technical, long).
chunk = ("Grouped-query attention shares key and value projections across head groups, "
       "reducing the KV cache by a factor of H/G where H is query heads and G is groups. "
       "Llama-3-70B uses 64 query heads and 8 KV heads — an 8x cache reduction over MHA "
       "with no measurable quality loss on standard benchmarks.")

# Short jargon query — what a user actually types.
query = "How does GQA reduce memory?"

# Hypothetical answer — what an LLM would generate from the query alone (HyDE step).
# In production, this comes from llm.generate(); here we hard-code a plausible one.
hyde_answer = ("Grouped-query attention reduces the key-value cache size by sharing "
             "key and value projections across multiple query heads, lowering memory "
             "for long-context inference.")

# Compare cosine similarities.
print(f"cos(query,        chunk) = {cosine(embed(query), embed(chunk)):.3f}")
print(f"cos(hyde-answer,  chunk) = {cosine(embed(hyde_answer), embed(chunk)):.3f}")
print()
print("The HyDE answer shares vocabulary with the chunk (\"key\", \"value\",")
print("\"projections\", \"reduce\", \"memory\") that the raw query never used.")
print("Real dense embeddings show the same effect more dramatically.")

from collections import Counter
import re, math

def embed(s):
  return Counter(re.findall(r"\w+", s.lower()))

def cosine(a, b):
  keys = set(a) | set(b)
  dot = sum(a[k] * b[k] for k in keys)
  na = sum(v*v for v in a.values()) ** 0.5
  nb = sum(v*v for v in b.values()) ** 0.5
  return dot / (na * nb + 1e-9)

# Real document chunk (technical, long).
chunk = ("Grouped-query attention shares key and value projections across head groups, "
       "reducing the KV cache by a factor of H/G where H is query heads and G is groups. "
       "Llama-3-70B uses 64 query heads and 8 KV heads — an 8x cache reduction over MHA "
       "with no measurable quality loss on standard benchmarks.")

# Short jargon query — what a user actually types.
query = "How does GQA reduce memory?"

# Hypothetical answer — what an LLM would generate from the query alone (HyDE step).
# In production, this comes from llm.generate(); here we hard-code a plausible one.
hyde_answer = ("Grouped-query attention reduces the key-value cache size by sharing "
             "key and value projections across multiple query heads, lowering memory "
             "for long-context inference.")

# Compare cosine similarities.
print(f"cos(query,        chunk) = {cosine(embed(query), embed(chunk)):.3f}")
print(f"cos(hyde-answer,  chunk) = {cosine(embed(hyde_answer), embed(chunk)):.3f}")
print()
print("The HyDE answer shares vocabulary with the chunk (\"key\", \"value\",")
print("\"projections\", \"reduce\", \"memory\") that the raw query never used.")
print("Real dense embeddings show the same effect more dramatically.")

Ctrl+Enter to run

The takeaway: HyDE works because embedded space proximity is shaped by vocabulary as much as semantics. The fake answer brings the query into the chunk’s neighborhood. RRF-fusing both query and hyde_answer retrievals is the production pattern — robust when the LLM hallucinates a wrong premise.

Quick check

Your RAG pipeline answers single-fact questions correctly but fails on 'What's the trend across these five papers?'-style questions. What's the most appropriate fix?

Key takeaways

Naive RAG has 4 known failure modes. Diagnose which you’re hitting before reaching for a technique.
HyDE for vocab mismatch, decomposition for multi-hop, GraphRAG for global questions, agents for open-ended exploration. Match the technique to the failure.
Always layer on top of hybrid + rerank, not as a replacement. Naive retrieval still does most of the work.
Cost grows fast. A 4-step agent costs 8× a single retrieval. Production systems route easy queries to the cheap path and reserve advanced techniques for the hard ones.
Eval honestly. Build a 50-question test set with the failure modes labeled. Vibes-based RAG tuning is how teams ship regressions.

Go deeper

PaperPrecise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) · Gao et al. (CMU, 2022)The original HyDE paper. Short and clear.
PaperFrom Local to Global: A GraphRAG Approach to Query-Focused Summarization · Edge et al. (Microsoft, 2024)The GraphRAG paper. Read it after trying naive RAG and feeling its limits.
PaperSelf-RAG: Learning to Retrieve, Generate, and Critique · Asai et al. (ICLR 2024)Trains the model to decide *when* to retrieve. Influential for agentic patterns.
Repomicrosoft/graphragReference implementation. Read `graphrag/index/` for how the indexing pipeline actually runs.
BlogContextual Retrieval · Anthropic, 2024A simpler alternative to GraphRAG: prepend a short LLM-generated context to each chunk before embedding. Surprisingly strong baseline.
DocsLlamaIndex — Advanced RetrievalPractical recipes for each technique with working code.
VideoGreg Kamradt — RAG From Scratch · Greg KamradtA 12-part series. Episodes 5-9 cover the techniques here with notebook walkthroughs.

TL;DR

Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on “global” questions a chunk can’t answer alone.
Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.

Why this matters

Naive RAG (embed → top-k → stuff in prompt → answer) hits a ceiling fast. The failures look like:

“What does the 2023 paper compare to the 2021 baseline?” → multi-hop
“Tell me about token routing” → vague query, all chunks score similar
“What did the authors conclude across their five papers?” → global question, no single chunk has the answer
“What is mentioned in the section before the conclusion?” → structural reference

Each of these is a known failure mode with a known fix. This lesson covers the four most-deployed ones in 2026 production RAG.

Mental model

These aren’t mutually exclusive — production systems often combine HyDE for vocab mismatch and decomposition for multi-hop, with the agent loop as fallback for the rest.

Concrete walkthrough — the four techniques, in 200 lines each

1. HyDE — bridge the query/document vocab gap


def hyde(query: str, llm, embedder, retriever) -> list[str]:
    fake_answer = llm.generate(
        f"Write a one-paragraph answer to: {query}\n\n"
        f"It's OK to be wrong on details — the goal is realistic phrasing."
    )
    return retriever.retrieve(embedder.encode(fake_answer), k=10)