Prefix & RadixAttention
Prereqs: KV Cache Basics and PagedAttention. This lesson is what happens between requests, not within them.
When you ship a chatbot with a 4K-token system prompt, the very first user message takes about 1.2 seconds to first token on a 70B model on H100 — that’s the prefill running 4K tokens through 80 layers. The 50,000th user message takes the same 1.2 seconds. The same KV vectors get computed for the same 4K tokens, fifty thousand times, in a fifty-thousand-fold rerun of identical math.
Prefix caching is the trick that turns the 50,000th prefill into a hash lookup. Every block whose contents are determined by a prefix the server has already seen is served from a hash table. Same model, same numbers, TTFT drops from 1.2 s to ~30 ms. The 70B model didn’t get faster; the workload got de-duplicated.
Every modern serving stack (vLLM v1, SGLang, TensorRT-LLM, TGI) has some form. The two designs you’ll encounter are vLLM’s Automatic Prefix Caching (APC), which is a hash table keyed on chained block hashes, and SGLang’s RadixAttention, which is a radix tree of every prefix the server has ever seen. APC wins on independent requests with shared prompts; the tree wins when conversations branch — agent loops with self-consistency sampling, evals running variants of one shot, multi-user chat with shared system context. Both rely on one fact about transformers: K and V at position t are deterministic functions of the prefix [0..t], so byte-identical prefixes have byte-identical KV.
TL;DR
- Most production traffic isn’t unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
- vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
- SGLang’s RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
- Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
- Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.
Why TTFT is the user-visible metric
Time-to-first-token (TTFT) is the user-visible latency in chat. TTFT is dominated by prefill — running every prompt token through every layer. If your system prompt is 4K tokens, every new conversation eats a 4K-token prefill before the user sees anything. Prefix caching turns that 4K prefill into a hash lookup. On a 70B model on H100, that’s the difference between ~1.2 s TTFT and ~30 ms. For agent loops that re-issue tool-augmented prompts hundreds of times per task, the impact compounds — total wall time can drop 5–10×.
This is also why “is your serving stack prefix-aware” is a real procurement question in 2026. vLLM v1, SGLang, TensorRT-LLM, and Hugging Face TGI all support some form. The differences are in tree structure, eviction policy, and how aggressively they cache mid-conversation suffixes.
Mental model
Every node is a contiguous block-aligned chunk of tokens. Every edge means “extend this prefix.” When Alice sends a third message, the server walks the tree from root → system → Alice’s first turn → reply → 2nd msg, finds the longest cached prefix, and only prefills the new tail. When the cache fills, an LRU sweep evicts leaves first.
Three flavours of prefix cache
| System | Granularity | Sharing scope | Eviction | Notes |
|---|---|---|---|---|
| vLLM APC | KV block (16 tokens) | Per-block hash table | LRU | Shipped v0.5; on by default in v1. Tracks block-content hashes globally. |
| SGLang RadixAttn | Variable-length edge | Tree (any prefix) | LRU on tree | Best for branching chat; explicit handling of conversation history. |
| TensorRT-LLM | KV block | Per-block hash table | LRU + reuse score | Lower-level; integrates with NVIDIA’s batch scheduler. |
| HF TGI | Per-conversation | Pinned prefix (manual) | Manual / TTL | Simpler model; client passes a cache_id. |
The mental difference: APC is a hash table; RadixAttention is a tree. The tree wins when the same conversation forks (multi-turn chat with branching, agent loops with self-consistency, evals running variants of the same prompt).
Why a block-aligned hash works
KV blocks are typically 16 tokens. K and V at position t are projections of h_t, the hidden state output by the previous layer — and h_t was produced by self-attention over positions [0..t]. So a block’s KV contents are fully determined by the prefix of tokens up to that block.
That means the hash of a block is hash(prefix_tokens_so_far). Two requests that share the first 312 blocks bit-exactly share KV for those 312 blocks — no recomputation, no quality cost, no approximation.
# vLLM's APC, in spirit
def block_hash(prev_block_hash: int, tokens: tuple[int, ...]) -> int:
return hash((prev_block_hash, tokens)) # chained → prefix-determined
# On prefill:
existing = []
prev_hash = 0
for block_tokens in chunked(prompt_tokens, BLOCK=16):
h = block_hash(prev_hash, block_tokens)
if h in self.cache:
existing.append(self.cache[h]) # reuse: zero compute
prev_hash = h
else:
break # first miss = recompute from here on
new_blocks = run_prefill(prompt_tokens[len(existing) * 16 :])
self.cache.update(zip(new_hashes, new_blocks))The “first miss = recompute from here on” rule is critical: cache hits must be a contiguous prefix from the front. You can’t have a hit at block 4, miss at block 5, hit at block 6 — the K, V at block 6 are functions of everything up to block 5.
What hit rate looks like in practice
| Workload | Typical prefix hit rate | TTFT speedup |
|---|---|---|
| Single-shot Q&A (no shared sys prompt) | 0–5% | ~1× |
| Customer-support chatbot | 60–85% | 3–6× |
| Agent w/ ReAct loop | 80–95% | 5–10× |
| Evals (N variants of same shot template) | 95%+ | 10–20× |
| Coding assistant (repo-level context) | 70–90% | 4–8× |
Hit rate is the headline metric. Track it (vLLM exposes prefix_cache_hit_rate per-engine). If it’s below 50% on a chat workload, your prompt template is probably non-canonical (RNG seeds, timestamps, ID injection at the front instead of the back).
Cache-busting prompts to avoid
Anything that mutates the prefix breaks sharing. Common foot-guns:
- Timestamps near the start. Move them to the end of the system message.
- Per-user IDs in the system prompt. Move to a tool definition or the user turn.
- Random IDs (UUIDs, request IDs). Almost never belong in the LLM input.
- Tokenizer mismatch between client and server — e.g., trailing whitespace handled differently. Always tokenize server-side and pin the tokenizer version.
- Different chat templates per call. Pin the template; don’t let downstream code rebuild it.
Run it in your browser
A working radix-tree prefix cache simulator. Insert sequences, watch shared edges, query hit rate.
A real implementation also tracks per-block reference counts, evicts via LRU on leaves, and pages KV blocks into and out of GPU memory. The skeleton above is the data-structure idea — vLLM’s APC and SGLang’s RadixAttention fill in the GPU plumbing.
Quick check
Key takeaways
- Prefix caching turns repeated prompts into hash lookups. TTFT goes from “prefill cost” to “memcpy cost.”
- Hits must be contiguous from the front — a single varying byte at offset 0 invalidates the entire downstream cache.
- Hit rate is the metric. Track it. If it’s below 50% on a chat workload, the prompt template is the bug.
- vLLM v1 = automatic prefix caching by default. SGLang’s RadixAttention does the same with a tree, which wins on branching workloads.
- The savings stack. PagedAttention tames intra-request memory; prefix caching tames inter-request duplication; speculative decoding tames decode latency. They compose.
Go deeper
- PaperSGLang: Efficient Execution of Structured Language Model ProgramsSection 4 introduces RadixAttention. The clearest published description of a tree-structured KV cache.
- PaperEfficient Memory Management for LLM Serving with PagedAttentionThe vLLM paper. APC is a follow-on; this is the foundation.
- BlogvLLM — Automatic Prefix CachingAuthoritative explainer with benchmarks. Read this and the v1 update.
- BlogSGLang v0.4 — RadixAttention with Llama 3Hit-rate plots on real chat workloads.
- DocsvLLM docs — Prefix CachingHow to enable, monitor, and debug. Includes the metric names you want in Grafana.
- Reposgl-project/sglangReference implementation. `python/sglang/srt/mem_cache/radix_cache.py` is the data structure; `model_runner.py` shows how it integrates with prefill.
- Repovllm-project/vllmSee `vllm/v1/core/kv_cache_manager.py` for the v1 prefix-cache implementation.
Prereqs: KV Cache Basics and PagedAttention. This lesson is what happens between requests, not within them.
TL;DR
- Most production traffic isn’t unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
- vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
- SGLang’s RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
- Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
- Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.
Why this matters
Time-to-first-token (TTFT) is the user-visible latency in chat. TTFT is dominated by prefill — running every prompt token through every layer. If your system prompt is 4K tokens, every new conversation eats a 4K-token prefill before the user sees anything. Prefix caching turns that 4K prefill into a hash lookup. On a 70B model on H100, that’s the difference between ~1.2 s TTFT and ~30 ms. For agent loops that re-issue tool-augmented prompts hundreds of times per task, the impact compounds — total wall time can drop 5–10×.
This is also why “is your serving stack prefix-aware” is a real procurement question in 2026. vLLM v1, SGLang, TensorRT-LLM, and Hugging Face TGI all support some form. The differences are in tree structure, eviction policy, and how aggressively they cache mid-conversation suffixes.
Mental model
Every node is a contiguous block-aligned chunk of tokens. Every edge means “extend this prefix.” When Alice sends a third message, the server walks the tree from root → system → Alice’s first turn → reply → 2nd msg, finds the longest cached prefix, and only prefills the new tail. When the cache fills, an LRU sweep evicts leaves first.
Concrete walkthrough
Three flavours of prefix cache
| System | Granularity | Sharing scope | Eviction | Notes |
|---|---|---|---|---|
| vLLM APC | KV block (16 tokens) | Per-block hash table | LRU | Shipped v0.5; on by default in v1. Tracks block-content hashes globally. |
| SGLang RadixAttn | Variable-length edge | Tree (any prefix) | LRU on tree | Best for branching chat; explicit handling of conversation history. |
| TensorRT-LLM | KV block | Per-block hash table | LRU + reuse score | Lower-level; integrates with NVIDIA’s batch scheduler. |
| HF TGI | Per-conversation | Pinned prefix (manual) | Manual / TTL | Simpler model; client passes a cache_id. |
The mental difference: APC is a hash table; RadixAttention is a tree. The tree wins when the same conversation forks (multi-turn chat with branching, agent loops with self-consistency, evals running variants of the same prompt).
Why a block-aligned hash works
KV blocks are typically 16 tokens. K and V at position t are projections of h_t, the hidden state output by the previous layer — and h_t was produced by self-attention over positions [0..t]. So a block’s KV contents are fully determined by the prefix of tokens up to that block.
That means the hash of a block is hash(prefix_tokens_so_far). Two requests that share the first 312 blocks bit-exactly share KV for those 312 blocks — no recomputation, no quality cost, no approximation.
# vLLM's APC, in spirit
def block_hash(prev_block_hash: int, tokens: tuple[int, ...]) -> int:
return hash((prev_block_hash, tokens)) # chained → prefix-determined
# On prefill:
existing = []
prev_hash = 0
for block_tokens in chunked(prompt_tokens, BLOCK=16):
h = block_hash(prev_hash, block_tokens)
if h in self.cache:
existing.append(self.cache[h]) # reuse: zero compute
prev_hash = h
else:
break # first miss = recompute from here on
new_blocks = run_prefill(prompt_tokens[len(existing) * 16 :])
self.cache.update(zip(new_hashes, new_blocks))The “first miss = recompute from here on” rule is critical: cache hits must be a contiguous prefix from the front. You can’t have a hit at block 4, miss at block 5, hit at block 6 — the K, V at block 6 are functions of everything up to block 5.
What hit rate looks like in practice
| Workload | Typical prefix hit rate | TTFT speedup |
|---|---|---|
| Single-shot Q&A (no shared sys prompt) | 0–5% | ~1× |
| Customer-support chatbot | 60–85% | 3–6× |
| Agent w/ ReAct loop | 80–95% | 5–10× |
| Evals (N variants of same shot template) | 95%+ | 10–20× |
| Coding assistant (repo-level context) | 70–90% | 4–8× |
Hit rate is the headline metric. Track it (vLLM exposes prefix_cache_hit_rate per-engine). If it’s below 50% on a chat workload, your prompt template is probably non-canonical (RNG seeds, timestamps, ID injection at the front instead of the back).
Cache-busting prompts to avoid
Anything that mutates the prefix breaks sharing. Common foot-guns:
- Timestamps near the start. Move them to the end of the system message.
- Per-user IDs in the system prompt. Move to a tool definition or the user turn.
- Random IDs (UUIDs, request IDs). Almost never belong in the LLM input.
- Tokenizer mismatch between client and server — e.g., trailing whitespace handled differently. Always tokenize server-side and pin the tokenizer version.
- Different chat templates per call. Pin the template; don’t let downstream code rebuild it.
Run it in your browser
A working radix-tree prefix cache simulator. Insert sequences, watch shared edges, query hit rate.
A real implementation also tracks per-block reference counts, evicts via LRU on leaves, and pages KV blocks into and out of GPU memory. The skeleton above is the data-structure idea — vLLM’s APC and SGLang’s RadixAttention fill in the GPU plumbing.
Quick check
Key takeaways
- Prefix caching turns repeated prompts into hash lookups. TTFT goes from “prefill cost” to “memcpy cost.”
- Hits must be contiguous from the front — a single varying byte at offset 0 invalidates the entire downstream cache.
- Hit rate is the metric. Track it. If it’s below 50% on a chat workload, the prompt template is the bug.
- vLLM v1 = automatic prefix caching by default. SGLang’s RadixAttention does the same with a tree, which wins on branching workloads.
- The savings stack. PagedAttention tames intra-request memory; prefix caching tames inter-request duplication; speculative decoding tames decode latency. They compose.
Go deeper
- PaperSGLang: Efficient Execution of Structured Language Model ProgramsSection 4 introduces RadixAttention. The clearest published description of a tree-structured KV cache.
- PaperEfficient Memory Management for LLM Serving with PagedAttentionThe vLLM paper. APC is a follow-on; this is the foundation.
- BlogvLLM — Automatic Prefix CachingAuthoritative explainer with benchmarks. Read this and the v1 update.
- BlogSGLang v0.4 — RadixAttention with Llama 3Hit-rate plots on real chat workloads.
- DocsvLLM docs — Prefix CachingHow to enable, monitor, and debug. Includes the metric names you want in Grafana.
- Reposgl-project/sglangReference implementation. `python/sglang/srt/mem_cache/radix_cache.py` is the data structure; `model_runner.py` shows how it integrates with prefill.
- Repovllm-project/vllmSee `vllm/v1/core/kv_cache_manager.py` for the v1 prefix-cache implementation.