Skip to content

Prefix & RadixAttention

Prereqs: KV Cache Basics and PagedAttention. This lesson is what happens between requests, not within them.

When you ship a chatbot with a 4K-token system prompt, the very first user message takes about 1.2 seconds to first token on a 70B model on H100 — that’s the prefill running 4K tokens through 80 layers. The 50,000th user message takes the same 1.2 seconds. The same KV vectors get computed for the same 4K tokens, fifty thousand times, in a fifty-thousand-fold rerun of identical math.

Prefix caching is the trick that turns the 50,000th prefill into a hash lookup. Every block whose contents are determined by a prefix the server has already seen is served from a hash table. Same model, same numbers, TTFT drops from 1.2 s to ~30 ms. The 70B model didn’t get faster; the workload got de-duplicated.

Every modern serving stack (vLLM v1, SGLang, TensorRT-LLM, TGI) has some form. The two designs you’ll encounter are vLLM’s Automatic Prefix Caching (APC), which is a hash table keyed on chained block hashes, and SGLang’s RadixAttention, which is a radix tree of every prefix the server has ever seen. APC wins on independent requests with shared prompts; the tree wins when conversations branch — agent loops with self-consistency sampling, evals running variants of one shot, multi-user chat with shared system context. Both rely on one fact about transformers: K and V at position t are deterministic functions of the prefix [0..t], so byte-identical prefixes have byte-identical KV.

TL;DR

  • Most production traffic isn’t unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
  • vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
  • SGLang’s RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
  • Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
  • Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.

Why TTFT is the user-visible metric

Time-to-first-token (TTFT) is the user-visible latency in chat. TTFT is dominated by prefill — running every prompt token through every layer. If your system prompt is 4K tokens, every new conversation eats a 4K-token prefill before the user sees anything. Prefix caching turns that 4K prefill into a hash lookup. On a 70B model on H100, that’s the difference between ~1.2 s TTFT and ~30 ms. For agent loops that re-issue tool-augmented prompts hundreds of times per task, the impact compounds — total wall time can drop 5–10×.

This is also why “is your serving stack prefix-aware” is a real procurement question in 2026. vLLM v1, SGLang, TensorRT-LLM, and Hugging Face TGI all support some form. The differences are in tree structure, eviction policy, and how aggressively they cache mid-conversation suffixes.

Mental model

Every node is a contiguous block-aligned chunk of tokens. Every edge means “extend this prefix.” When Alice sends a third message, the server walks the tree from root → system → Alice’s first turn → reply → 2nd msg, finds the longest cached prefix, and only prefills the new tail. When the cache fills, an LRU sweep evicts leaves first.

Three flavours of prefix cache

SystemGranularitySharing scopeEvictionNotes
vLLM APCKV block (16 tokens)Per-block hash tableLRUShipped v0.5; on by default in v1. Tracks block-content hashes globally.
SGLang RadixAttnVariable-length edgeTree (any prefix)LRU on treeBest for branching chat; explicit handling of conversation history.
TensorRT-LLMKV blockPer-block hash tableLRU + reuse scoreLower-level; integrates with NVIDIA’s batch scheduler.
HF TGIPer-conversationPinned prefix (manual)Manual / TTLSimpler model; client passes a cache_id.

The mental difference: APC is a hash table; RadixAttention is a tree. The tree wins when the same conversation forks (multi-turn chat with branching, agent loops with self-consistency, evals running variants of the same prompt).

Why a block-aligned hash works

KV blocks are typically 16 tokens. K and V at position t are projections of h_t, the hidden state output by the previous layer — and h_t was produced by self-attention over positions [0..t]. So a block’s KV contents are fully determined by the prefix of tokens up to that block.

That means the hash of a block is hash(prefix_tokens_so_far). Two requests that share the first 312 blocks bit-exactly share KV for those 312 blocks — no recomputation, no quality cost, no approximation.

# vLLM's APC, in spirit def block_hash(prev_block_hash: int, tokens: tuple[int, ...]) -> int: return hash((prev_block_hash, tokens)) # chained → prefix-determined # On prefill: existing = [] prev_hash = 0 for block_tokens in chunked(prompt_tokens, BLOCK=16): h = block_hash(prev_hash, block_tokens) if h in self.cache: existing.append(self.cache[h]) # reuse: zero compute prev_hash = h else: break # first miss = recompute from here on new_blocks = run_prefill(prompt_tokens[len(existing) * 16 :]) self.cache.update(zip(new_hashes, new_blocks))

The “first miss = recompute from here on” rule is critical: cache hits must be a contiguous prefix from the front. You can’t have a hit at block 4, miss at block 5, hit at block 6 — the K, V at block 6 are functions of everything up to block 5.

What hit rate looks like in practice

WorkloadTypical prefix hit rateTTFT speedup
Single-shot Q&A (no shared sys prompt)0–5%~1×
Customer-support chatbot60–85%3–6×
Agent w/ ReAct loop80–95%5–10×
Evals (N variants of same shot template)95%+10–20×
Coding assistant (repo-level context)70–90%4–8×

Hit rate is the headline metric. Track it (vLLM exposes prefix_cache_hit_rate per-engine). If it’s below 50% on a chat workload, your prompt template is probably non-canonical (RNG seeds, timestamps, ID injection at the front instead of the back).

Cache-busting prompts to avoid

Anything that mutates the prefix breaks sharing. Common foot-guns:

  • Timestamps near the start. Move them to the end of the system message.
  • Per-user IDs in the system prompt. Move to a tool definition or the user turn.
  • Random IDs (UUIDs, request IDs). Almost never belong in the LLM input.
  • Tokenizer mismatch between client and server — e.g., trailing whitespace handled differently. Always tokenize server-side and pin the tokenizer version.
  • Different chat templates per call. Pin the template; don’t let downstream code rebuild it.

Run it in your browser

A working radix-tree prefix cache simulator. Insert sequences, watch shared edges, query hit rate.

Python — editableBuilds a prefix tree across 6 chat sequences. Shows nodes and the prefix-cache hit rate.
Ctrl+Enter to run

A real implementation also tracks per-block reference counts, evicts via LRU on leaves, and pages KV blocks into and out of GPU memory. The skeleton above is the data-structure idea — vLLM’s APC and SGLang’s RadixAttention fill in the GPU plumbing.

Quick check

Fill in the blank
A cache hit must always start from the very first token of the prompt — you cannot skip a missed block in the middle. The technical reason:
Why are KV vectors at position t a function of the entire preceding context?
Quick check
A team's chat product has a 4K-token system prompt. They're seeing 5% prefix-cache hit rate. The biggest single thing they can change?

Key takeaways

  1. Prefix caching turns repeated prompts into hash lookups. TTFT goes from “prefill cost” to “memcpy cost.”
  2. Hits must be contiguous from the front — a single varying byte at offset 0 invalidates the entire downstream cache.
  3. Hit rate is the metric. Track it. If it’s below 50% on a chat workload, the prompt template is the bug.
  4. vLLM v1 = automatic prefix caching by default. SGLang’s RadixAttention does the same with a tree, which wins on branching workloads.
  5. The savings stack. PagedAttention tames intra-request memory; prefix caching tames inter-request duplication; speculative decoding tames decode latency. They compose.

Go deeper

Prereqs: KV Cache Basics and PagedAttention. This lesson is what happens between requests, not within them.

TL;DR

  • Most production traffic isn’t unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
  • vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
  • SGLang’s RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
  • Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
  • Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.

Why this matters

Time-to-first-token (TTFT) is the user-visible latency in chat. TTFT is dominated by prefill — running every prompt token through every layer. If your system prompt is 4K tokens, every new conversation eats a 4K-token prefill before the user sees anything. Prefix caching turns that 4K prefill into a hash lookup. On a 70B model on H100, that’s the difference between ~1.2 s TTFT and ~30 ms. For agent loops that re-issue tool-augmented prompts hundreds of times per task, the impact compounds — total wall time can drop 5–10×.

This is also why “is your serving stack prefix-aware” is a real procurement question in 2026. vLLM v1, SGLang, TensorRT-LLM, and Hugging Face TGI all support some form. The differences are in tree structure, eviction policy, and how aggressively they cache mid-conversation suffixes.

Mental model

Every node is a contiguous block-aligned chunk of tokens. Every edge means “extend this prefix.” When Alice sends a third message, the server walks the tree from root → system → Alice’s first turn → reply → 2nd msg, finds the longest cached prefix, and only prefills the new tail. When the cache fills, an LRU sweep evicts leaves first.

Concrete walkthrough

Three flavours of prefix cache

SystemGranularitySharing scopeEvictionNotes
vLLM APCKV block (16 tokens)Per-block hash tableLRUShipped v0.5; on by default in v1. Tracks block-content hashes globally.
SGLang RadixAttnVariable-length edgeTree (any prefix)LRU on treeBest for branching chat; explicit handling of conversation history.
TensorRT-LLMKV blockPer-block hash tableLRU + reuse scoreLower-level; integrates with NVIDIA’s batch scheduler.
HF TGIPer-conversationPinned prefix (manual)Manual / TTLSimpler model; client passes a cache_id.

The mental difference: APC is a hash table; RadixAttention is a tree. The tree wins when the same conversation forks (multi-turn chat with branching, agent loops with self-consistency, evals running variants of the same prompt).

Why a block-aligned hash works

KV blocks are typically 16 tokens. K and V at position t are projections of h_t, the hidden state output by the previous layer — and h_t was produced by self-attention over positions [0..t]. So a block’s KV contents are fully determined by the prefix of tokens up to that block.

That means the hash of a block is hash(prefix_tokens_so_far). Two requests that share the first 312 blocks bit-exactly share KV for those 312 blocks — no recomputation, no quality cost, no approximation.

# vLLM's APC, in spirit def block_hash(prev_block_hash: int, tokens: tuple[int, ...]) -> int: return hash((prev_block_hash, tokens)) # chained → prefix-determined # On prefill: existing = [] prev_hash = 0 for block_tokens in chunked(prompt_tokens, BLOCK=16): h = block_hash(prev_hash, block_tokens) if h in self.cache: existing.append(self.cache[h]) # reuse: zero compute prev_hash = h else: break # first miss = recompute from here on new_blocks = run_prefill(prompt_tokens[len(existing) * 16 :]) self.cache.update(zip(new_hashes, new_blocks))

The “first miss = recompute from here on” rule is critical: cache hits must be a contiguous prefix from the front. You can’t have a hit at block 4, miss at block 5, hit at block 6 — the K, V at block 6 are functions of everything up to block 5.

What hit rate looks like in practice

WorkloadTypical prefix hit rateTTFT speedup
Single-shot Q&A (no shared sys prompt)0–5%~1×
Customer-support chatbot60–85%3–6×
Agent w/ ReAct loop80–95%5–10×
Evals (N variants of same shot template)95%+10–20×
Coding assistant (repo-level context)70–90%4–8×

Hit rate is the headline metric. Track it (vLLM exposes prefix_cache_hit_rate per-engine). If it’s below 50% on a chat workload, your prompt template is probably non-canonical (RNG seeds, timestamps, ID injection at the front instead of the back).

Cache-busting prompts to avoid

Anything that mutates the prefix breaks sharing. Common foot-guns:

  • Timestamps near the start. Move them to the end of the system message.
  • Per-user IDs in the system prompt. Move to a tool definition or the user turn.
  • Random IDs (UUIDs, request IDs). Almost never belong in the LLM input.
  • Tokenizer mismatch between client and server — e.g., trailing whitespace handled differently. Always tokenize server-side and pin the tokenizer version.
  • Different chat templates per call. Pin the template; don’t let downstream code rebuild it.

Run it in your browser

A working radix-tree prefix cache simulator. Insert sequences, watch shared edges, query hit rate.

Python — editableBuilds a prefix tree across 6 chat sequences. Shows nodes and the prefix-cache hit rate.
Ctrl+Enter to run

A real implementation also tracks per-block reference counts, evicts via LRU on leaves, and pages KV blocks into and out of GPU memory. The skeleton above is the data-structure idea — vLLM’s APC and SGLang’s RadixAttention fill in the GPU plumbing.

Quick check

Fill in the blank
A cache hit must always start from the very first token of the prompt — you cannot skip a missed block in the middle. The technical reason:
Why are KV vectors at position t a function of the entire preceding context?
Quick check
A team's chat product has a 4K-token system prompt. They're seeing 5% prefix-cache hit rate. The biggest single thing they can change?

Key takeaways

  1. Prefix caching turns repeated prompts into hash lookups. TTFT goes from “prefill cost” to “memcpy cost.”
  2. Hits must be contiguous from the front — a single varying byte at offset 0 invalidates the entire downstream cache.
  3. Hit rate is the metric. Track it. If it’s below 50% on a chat workload, the prompt template is the bug.
  4. vLLM v1 = automatic prefix caching by default. SGLang’s RadixAttention does the same with a tree, which wins on branching workloads.
  5. The savings stack. PagedAttention tames intra-request memory; prefix caching tames inter-request duplication; speculative decoding tames decode latency. They compose.

Go deeper