Prefix & RadixAttention

Prereqs: KV Cache Basics and PagedAttention. This lesson is what happens between requests, not within them.

When you ship a chatbot with a 4K-token system prompt, the very first user message takes about 1.2 seconds to first token on a 70B model on H100 — that’s the prefill running 4K tokens through 80 layers. The 50,000th user message takes the same 1.2 seconds. The same KV vectors get computed for the same 4K tokens, fifty thousand times, in a fifty-thousand-fold rerun of identical math.

Prefix caching is the trick that turns the 50,000th prefill into a hash lookup. Every block whose contents are determined by a prefix the server has already seen is served from a hash table. Same model, same numbers, TTFT drops from 1.2 s to ~30 ms. The 70B model didn’t get faster; the workload got de-duplicated.

Every modern serving stack (vLLM v1, SGLang, TensorRT-LLM, TGI) has some form. The two designs you’ll encounter are vLLM’s Automatic Prefix Caching (APC), which is a hash table keyed on chained block hashes, and SGLang’s RadixAttention, which is a radix tree of every prefix the server has ever seen. APC wins on independent requests with shared prompts; the tree wins when conversations branch — agent loops with self-consistency sampling, evals running variants of one shot, multi-user chat with shared system context. Both rely on one fact about transformers: K and V at position t are deterministic functions of the prefix [0..t], so byte-identical prefixes have byte-identical KV.

TL;DR

Most production traffic isn’t unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
SGLang’s RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.

Why TTFT is the user-visible metric

Time-to-first-token (TTFT) is the user-visible latency in chat. TTFT is dominated by prefill — running every prompt token through every layer. If your system prompt is 4K tokens, every new conversation eats a 4K-token prefill before the user sees anything. Prefix caching turns that 4K prefill into a hash lookup. On a 70B model on H100, that’s the difference between ~1.2 s TTFT and ~30 ms. For agent loops that re-issue tool-augmented prompts hundreds of times per task, the impact compounds — total wall time can drop 5–10×.

This is also why “is your serving stack prefix-aware” is a real procurement question in 2026. vLLM v1, SGLang, TensorRT-LLM, and Hugging Face TGI all support some form. The differences are in tree structure, eviction policy, and how aggressively they cache mid-conversation suffixes.

Mental model

Every node is a contiguous block-aligned chunk of tokens. Every edge means “extend this prefix.” When Alice sends a third message, the server walks the tree from root → system → Alice’s first turn → reply → 2nd msg, finds the longest cached prefix, and only prefills the new tail. When the cache fills, an LRU sweep evicts leaves first.

Three flavours of prefix cache

System	Granularity	Sharing scope	Eviction	Notes
vLLM APC	KV block (16 tokens)	Per-block hash table	LRU	Shipped v0.5; on by default in v1. Tracks block-content hashes globally.
SGLang RadixAttn	Variable-length edge	Tree (any prefix)	LRU on tree	Best for branching chat; explicit handling of conversation history.
TensorRT-LLM	KV block	Per-block hash table	LRU + reuse score	Lower-level; integrates with NVIDIA’s batch scheduler.
HF TGI	Per-conversation	Pinned prefix (manual)	Manual / TTL	Simpler model; client passes a `cache_id`.

The mental difference: APC is a hash table; RadixAttention is a tree. The tree wins when the same conversation forks (multi-turn chat with branching, agent loops with self-consistency, evals running variants of the same prompt).

Why a block-aligned hash works

KV blocks are typically 16 tokens. K and V at position t are projections of h_t, the hidden state output by the previous layer — and h_t was produced by self-attention over positions [0..t]. So a block’s KV contents are fully determined by the prefix of tokens up to that block.

That means the hash of a block is hash(prefix_tokens_so_far). Two requests that share the first 312 blocks bit-exactly share KV for those 312 blocks — no recomputation, no quality cost, no approximation.


# vLLM's APC, in spirit
def block_hash(prev_block_hash: int, tokens: tuple[int, ...]) -> int:
    return hash((prev_block_hash, tokens))   # chained → prefix-determined
 
# On prefill:
existing = []
prev_hash = 0
for block_tokens in chunked(prompt_tokens, BLOCK=16):
    h = block_hash(prev_hash, block_tokens)
    if h in self.cache:
        existing.append(self.cache[h])       # reuse: zero compute
        prev_hash = h
    else:
        break                                 # first miss = recompute from here on
 
new_blocks = run_prefill(prompt_tokens[len(existing) * 16 :])
self.cache.update(zip(new_hashes, new_blocks))

The “first miss = recompute from here on” rule is critical: cache hits must be a contiguous prefix from the front. You can’t have a hit at block 4, miss at block 5, hit at block 6 — the K, V at block 6 are functions of everything up to block 5.

What hit rate looks like in practice

Workload	Typical prefix hit rate	TTFT speedup
Single-shot Q&A (no shared sys prompt)	0–5%	~1×
Customer-support chatbot	60–85%	3–6×
Agent w/ ReAct loop	80–95%	5–10×
Evals (N variants of same shot template)	95%+	10–20×
Coding assistant (repo-level context)	70–90%	4–8×

Hit rate is the headline metric. Track it (vLLM exposes prefix_cache_hit_rate per-engine). If it’s below 50% on a chat workload, your prompt template is probably non-canonical (RNG seeds, timestamps, ID injection at the front instead of the back).

Cache-busting prompts to avoid

Anything that mutates the prefix breaks sharing. Common foot-guns:

Timestamps near the start. Move them to the end of the system message.
Per-user IDs in the system prompt. Move to a tool definition or the user turn.
Random IDs (UUIDs, request IDs). Almost never belong in the LLM input.
Tokenizer mismatch between client and server — e.g., trailing whitespace handled differently. Always tokenize server-side and pin the tokenizer version.
Different chat templates per call. Pin the template; don’t let downstream code rebuild it.

Run it in your browser

A working radix-tree prefix cache simulator. Insert sequences, watch shared edges, query hit rate.

Python — editableBuilds a prefix tree across 6 chat sequences. Shows nodes and the prefix-cache hit rate.

from collections import defaultdict

class RadixNode:
  __slots__ = ('children', 'tokens', 'value')
  def __init__(self, tokens=()):
      self.children = {}     # first-token → RadixNode
      self.tokens = tokens   # full token tuple stored along this edge
      self.value = None      # would be the KV blocks; we just track presence

class RadixTree:
  def __init__(self):
      self.root = RadixNode()
      self.tokens_seen = 0
      self.tokens_hit = 0

  @staticmethod
  def lcp(a, b):
      i = 0
      while i < len(a) and i < len(b) and a[i] == b[i]: i += 1
      return i

  def insert(self, seq):
      self.tokens_seen += len(seq)
      node = self.root
      i = 0
      while i < len(seq):
          first = seq[i]
          if first not in node.children:
              # No child starting with this token — fresh edge.
              child = RadixNode(tuple(seq[i:]))
              child.value = True
              node.children[first] = child
              return
          child = node.children[first]
          n = self.lcp(child.tokens, seq[i:])
          self.tokens_hit += n
          if n == len(child.tokens):
              i += n
              node = child                       # walked into existing edge
          else:
              # Need to split the edge at offset n.
              old_tail = RadixNode(child.tokens[n:])
              old_tail.value, old_tail.children = child.value, child.children
              child.tokens = child.tokens[:n]
              child.value = None
              child.children = {old_tail.tokens[0]: old_tail}
              if i + n < len(seq):
                  new_tail = RadixNode(tuple(seq[i+n:]))
                  new_tail.value = True
                  child.children[seq[i+n]] = new_tail
              else:
                  child.value = True
              return

  def report(self):
      return {
          'tokens_seen': self.tokens_seen,
          'tokens_hit':  self.tokens_hit,
          'hit_rate':    self.tokens_hit / max(1, self.tokens_seen),
      }

# Simulate a chat workload: shared system prompt, two users, branching turns.
SYS = tuple(range(1, 21))                       # 20 tokens of system prompt
A1  = SYS + tuple(range(100, 110))              # Alice turn 1
A2  = A1  + tuple(range(120, 128))              # Alice turn 2
B1  = SYS + tuple(range(200, 215))              # Bob turn 1
B2  = B1  + tuple(range(220, 230))              # Bob turn 2
B3  = B2  + tuple(range(240, 252))              # Bob turn 3 (long)

tree = RadixTree()
for seq in (SYS, A1, A2, B1, B2, B3):
  tree.insert(seq)

r = tree.report()
print(f"tokens seen: {r['tokens_seen']}")
print(f"tokens hit:  {r['tokens_hit']}")
print(f"hit rate:    {r['hit_rate']:.1%}")
print()
print("With six sequences sharing a 20-token system prompt and forking,")
print("we already see substantial prefix reuse. Production traffic is far more redundant.")

from collections import defaultdict

class RadixNode:
  __slots__ = ('children', 'tokens', 'value')
  def __init__(self, tokens=()):
      self.children = {}     # first-token → RadixNode
      self.tokens = tokens   # full token tuple stored along this edge
      self.value = None      # would be the KV blocks; we just track presence

class RadixTree:
  def __init__(self):
      self.root = RadixNode()
      self.tokens_seen = 0
      self.tokens_hit = 0

  @staticmethod
  def lcp(a, b):
      i = 0
      while i < len(a) and i < len(b) and a[i] == b[i]: i += 1
      return i

  def insert(self, seq):
      self.tokens_seen += len(seq)
      node = self.root
      i = 0
      while i < len(seq):
          first = seq[i]
          if first not in node.children:
              # No child starting with this token — fresh edge.
              child = RadixNode(tuple(seq[i:]))
              child.value = True
              node.children[first] = child
              return
          child = node.children[first]
          n = self.lcp(child.tokens, seq[i:])
          self.tokens_hit += n
          if n == len(child.tokens):
              i += n
              node = child                       # walked into existing edge
          else:
              # Need to split the edge at offset n.
              old_tail = RadixNode(child.tokens[n:])
              old_tail.value, old_tail.children = child.value, child.children
              child.tokens = child.tokens[:n]
              child.value = None
              child.children = {old_tail.tokens[0]: old_tail}
              if i + n < len(seq):
                  new_tail = RadixNode(tuple(seq[i+n:]))
                  new_tail.value = True
                  child.children[seq[i+n]] = new_tail
              else:
                  child.value = True
              return

  def report(self):
      return {
          'tokens_seen': self.tokens_seen,
          'tokens_hit':  self.tokens_hit,
          'hit_rate':    self.tokens_hit / max(1, self.tokens_seen),
      }

# Simulate a chat workload: shared system prompt, two users, branching turns.
SYS = tuple(range(1, 21))                       # 20 tokens of system prompt
A1  = SYS + tuple(range(100, 110))              # Alice turn 1
A2  = A1  + tuple(range(120, 128))              # Alice turn 2
B1  = SYS + tuple(range(200, 215))              # Bob turn 1
B2  = B1  + tuple(range(220, 230))              # Bob turn 2
B3  = B2  + tuple(range(240, 252))              # Bob turn 3 (long)

tree = RadixTree()
for seq in (SYS, A1, A2, B1, B2, B3):
  tree.insert(seq)

r = tree.report()
print(f"tokens seen: {r['tokens_seen']}")
print(f"tokens hit:  {r['tokens_hit']}")
print(f"hit rate:    {r['hit_rate']:.1%}")
print()
print("With six sequences sharing a 20-token system prompt and forking,")
print("we already see substantial prefix reuse. Production traffic is far more redundant.")

from collections import defaultdict

class RadixNode:
  __slots__ = ('children', 'tokens', 'value')
  def __init__(self, tokens=()):
      self.children = {}     # first-token → RadixNode
      self.tokens = tokens   # full token tuple stored along this edge
      self.value = None      # would be the KV blocks; we just track presence

class RadixTree:
  def __init__(self):
      self.root = RadixNode()
      self.tokens_seen = 0
      self.tokens_hit = 0

  @staticmethod
  def lcp(a, b):
      i = 0
      while i < len(a) and i < len(b) and a[i] == b[i]: i += 1
      return i

  def insert(self, seq):
      self.tokens_seen += len(seq)
      node = self.root
      i = 0
      while i < len(seq):
          first = seq[i]
          if first not in node.children:
              # No child starting with this token — fresh edge.
              child = RadixNode(tuple(seq[i:]))
              child.value = True
              node.children[first] = child
              return
          child = node.children[first]
          n = self.lcp(child.tokens, seq[i:])
          self.tokens_hit += n
          if n == len(child.tokens):
              i += n
              node = child                       # walked into existing edge
          else:
              # Need to split the edge at offset n.
              old_tail = RadixNode(child.tokens[n:])
              old_tail.value, old_tail.children = child.value, child.children
              child.tokens = child.tokens[:n]
              child.value = None
              child.children = {old_tail.tokens[0]: old_tail}
              if i + n < len(seq):
                  new_tail = RadixNode(tuple(seq[i+n:]))
                  new_tail.value = True
                  child.children[seq[i+n]] = new_tail
              else:
                  child.value = True
              return

  def report(self):
      return {
          'tokens_seen': self.tokens_seen,
          'tokens_hit':  self.tokens_hit,
          'hit_rate':    self.tokens_hit / max(1, self.tokens_seen),
      }

# Simulate a chat workload: shared system prompt, two users, branching turns.
SYS = tuple(range(1, 21))                       # 20 tokens of system prompt
A1  = SYS + tuple(range(100, 110))              # Alice turn 1
A2  = A1  + tuple(range(120, 128))              # Alice turn 2
B1  = SYS + tuple(range(200, 215))              # Bob turn 1
B2  = B1  + tuple(range(220, 230))              # Bob turn 2
B3  = B2  + tuple(range(240, 252))              # Bob turn 3 (long)

tree = RadixTree()
for seq in (SYS, A1, A2, B1, B2, B3):
  tree.insert(seq)

r = tree.report()
print(f"tokens seen: {r['tokens_seen']}")
print(f"tokens hit:  {r['tokens_hit']}")
print(f"hit rate:    {r['hit_rate']:.1%}")
print()
print("With six sequences sharing a 20-token system prompt and forking,")
print("we already see substantial prefix reuse. Production traffic is far more redundant.")

Ctrl+Enter to run

A real implementation also tracks per-block reference counts, evicts via LRU on leaves, and pages KV blocks into and out of GPU memory. The skeleton above is the data-structure idea — vLLM’s APC and SGLang’s RadixAttention fill in the GPU plumbing.

Quick check

Fill in the blank

A cache hit must always start from the very first token of the prompt — you cannot skip a missed block in the middle. The technical reason:

Why are KV vectors at position t a function of the entire preceding context?

Quick check

A team's chat product has a 4K-token system prompt. They're seeing 5% prefix-cache hit rate. The biggest single thing they can change?

Key takeaways

Prefix caching turns repeated prompts into hash lookups. TTFT goes from “prefill cost” to “memcpy cost.”
Hits must be contiguous from the front — a single varying byte at offset 0 invalidates the entire downstream cache.
Hit rate is the metric. Track it. If it’s below 50% on a chat workload, the prompt template is the bug.
vLLM v1 = automatic prefix caching by default. SGLang’s RadixAttention does the same with a tree, which wins on branching workloads.
The savings stack. PagedAttention tames intra-request memory; prefix caching tames inter-request duplication; speculative decoding tames decode latency. They compose.

Go deeper

PaperSGLang: Efficient Execution of Structured Language Model Programs · Zheng et al., 2024Section 4 introduces RadixAttention. The clearest published description of a tree-structured KV cache.
PaperEfficient Memory Management for LLM Serving with PagedAttention · Kwon et al., SOSP 2023The vLLM paper. APC is a follow-on; this is the foundation.
BlogvLLM — Automatic Prefix CachingAuthoritative explainer with benchmarks. Read this and the v1 update.
BlogSGLang v0.4 — RadixAttention with Llama 3 · LMSYS, 2024Hit-rate plots on real chat workloads.
DocsvLLM docs — Prefix CachingHow to enable, monitor, and debug. Includes the metric names you want in Grafana.
Reposgl-project/sglangReference implementation. `python/sglang/srt/mem_cache/radix_cache.py` is the data structure; `model_runner.py` shows how it integrates with prefill.
Repovllm-project/vllmSee `vllm/v1/core/kv_cache_manager.py` for the v1 prefix-cache implementation.

Prereqs: KV Cache Basics and PagedAttention. This lesson is what happens between requests, not within them.

TL;DR

Most production traffic isn’t unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
SGLang’s RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.

Why this matters

Mental model

Concrete walkthrough

Three flavours of prefix cache

System	Granularity	Sharing scope	Eviction	Notes
vLLM APC	KV block (16 tokens)	Per-block hash table	LRU	Shipped v0.5; on by default in v1. Tracks block-content hashes globally.
SGLang RadixAttn	Variable-length edge	Tree (any prefix)	LRU on tree	Best for branching chat; explicit handling of conversation history.
TensorRT-LLM	KV block	Per-block hash table	LRU + reuse score	Lower-level; integrates with NVIDIA’s batch scheduler.
HF TGI	Per-conversation	Pinned prefix (manual)	Manual / TTL	Simpler model; client passes a `cache_id`.

Why a block-aligned hash works


# vLLM's APC, in spirit
def block_hash(prev_block_hash: int, tokens: tuple[int, ...]) -> int:
    return hash((prev_block_hash, tokens))   # chained → prefix-determined
 
# On prefill:
existing = []
prev_hash = 0
for block_tokens in chunked(prompt_tokens, BLOCK=16):
    h = block_hash(prev_hash, block_tokens)
    if h in self.cache:
        existing.append(self.cache[h])       # reuse: zero compute
        prev_hash = h
    else:
        break                                 # first miss = recompute from here on
 
new_blocks = run_prefill(prompt_tokens[len(existing) * 16 :])
self.cache.update(zip(new_hashes, new_blocks))

What hit rate looks like in practice

Workload	Typical prefix hit rate	TTFT speedup
Single-shot Q&A (no shared sys prompt)	0–5%	~1×
Customer-support chatbot	60–85%	3–6×
Agent w/ ReAct loop	80–95%	5–10×
Evals (N variants of same shot template)	95%+	10–20×
Coding assistant (repo-level context)	70–90%	4–8×

Cache-busting prompts to avoid

Anything that mutates the prefix breaks sharing. Common foot-guns:

Timestamps near the start. Move them to the end of the system message.
Per-user IDs in the system prompt. Move to a tool definition or the user turn.
Random IDs (UUIDs, request IDs). Almost never belong in the LLM input.
Tokenizer mismatch between client and server — e.g., trailing whitespace handled differently. Always tokenize server-side and pin the tokenizer version.
Different chat templates per call. Pin the template; don’t let downstream code rebuild it.

Run it in your browser

A working radix-tree prefix cache simulator. Insert sequences, watch shared edges, query hit rate.

Python — editableBuilds a prefix tree across 6 chat sequences. Shows nodes and the prefix-cache hit rate.

from collections import defaultdict

class RadixNode:
  __slots__ = ('children', 'tokens', 'value')
  def __init__(self, tokens=()):
      self.children = {}     # first-token → RadixNode
      self.tokens = tokens   # full token tuple stored along this edge
      self.value = None      # would be the KV blocks; we just track presence

class RadixTree:
  def __init__(self):
      self.root = RadixNode()
      self.tokens_seen = 0
      self.tokens_hit = 0

  @staticmethod
  def lcp(a, b):
      i = 0
      while i < len(a) and i < len(b) and a[i] == b[i]: i += 1
      return i

  def insert(self, seq):
      self.tokens_seen += len(seq)
      node = self.root
      i = 0
      while i < len(seq):
          first = seq[i]
          if first not in node.children:
              # No child starting with this token — fresh edge.
              child = RadixNode(tuple(seq[i:]))
              child.value = True
              node.children[first] = child
              return
          child = node.children[first]
          n = self.lcp(child.tokens, seq[i:])
          self.tokens_hit += n
          if n == len(child.tokens):
              i += n
              node = child                       # walked into existing edge
          else:
              # Need to split the edge at offset n.
              old_tail = RadixNode(child.tokens[n:])
              old_tail.value, old_tail.children = child.value, child.children
              child.tokens = child.tokens[:n]
              child.value = None
              child.children = {old_tail.tokens[0]: old_tail}
              if i + n < len(seq):
                  new_tail = RadixNode(tuple(seq[i+n:]))
                  new_tail.value = True
                  child.children[seq[i+n]] = new_tail
              else:
                  child.value = True
              return

  def report(self):
      return {
          'tokens_seen': self.tokens_seen,
          'tokens_hit':  self.tokens_hit,
          'hit_rate':    self.tokens_hit / max(1, self.tokens_seen),
      }

# Simulate a chat workload: shared system prompt, two users, branching turns.
SYS = tuple(range(1, 21))                       # 20 tokens of system prompt
A1  = SYS + tuple(range(100, 110))              # Alice turn 1
A2  = A1  + tuple(range(120, 128))              # Alice turn 2
B1  = SYS + tuple(range(200, 215))              # Bob turn 1
B2  = B1  + tuple(range(220, 230))              # Bob turn 2
B3  = B2  + tuple(range(240, 252))              # Bob turn 3 (long)

tree = RadixTree()
for seq in (SYS, A1, A2, B1, B2, B3):
  tree.insert(seq)

r = tree.report()
print(f"tokens seen: {r['tokens_seen']}")
print(f"tokens hit:  {r['tokens_hit']}")
print(f"hit rate:    {r['hit_rate']:.1%}")
print()
print("With six sequences sharing a 20-token system prompt and forking,")
print("we already see substantial prefix reuse. Production traffic is far more redundant.")

from collections import defaultdict

class RadixNode:
  __slots__ = ('children', 'tokens', 'value')
  def __init__(self, tokens=()):
      self.children = {}     # first-token → RadixNode
      self.tokens = tokens   # full token tuple stored along this edge
      self.value = None      # would be the KV blocks; we just track presence

class RadixTree:
  def __init__(self):
      self.root = RadixNode()
      self.tokens_seen = 0
      self.tokens_hit = 0

  @staticmethod
  def lcp(a, b):
      i = 0
      while i < len(a) and i < len(b) and a[i] == b[i]: i += 1
      return i

  def insert(self, seq):
      self.tokens_seen += len(seq)
      node = self.root
      i = 0
      while i < len(seq):
          first = seq[i]
          if first not in node.children:
              # No child starting with this token — fresh edge.
              child = RadixNode(tuple(seq[i:]))
              child.value = True
              node.children[first] = child
              return
          child = node.children[first]
          n = self.lcp(child.tokens, seq[i:])
          self.tokens_hit += n
          if n == len(child.tokens):
              i += n
              node = child                       # walked into existing edge
          else:
              # Need to split the edge at offset n.
              old_tail = RadixNode(child.tokens[n:])
              old_tail.value, old_tail.children = child.value, child.children
              child.tokens = child.tokens[:n]
              child.value = None
              child.children = {old_tail.tokens[0]: old_tail}
              if i + n < len(seq):
                  new_tail = RadixNode(tuple(seq[i+n:]))
                  new_tail.value = True
                  child.children[seq[i+n]] = new_tail
              else:
                  child.value = True
              return

  def report(self):
      return {
          'tokens_seen': self.tokens_seen,
          'tokens_hit':  self.tokens_hit,
          'hit_rate':    self.tokens_hit / max(1, self.tokens_seen),
      }

# Simulate a chat workload: shared system prompt, two users, branching turns.
SYS = tuple(range(1, 21))                       # 20 tokens of system prompt
A1  = SYS + tuple(range(100, 110))              # Alice turn 1
A2  = A1  + tuple(range(120, 128))              # Alice turn 2
B1  = SYS + tuple(range(200, 215))              # Bob turn 1
B2  = B1  + tuple(range(220, 230))              # Bob turn 2
B3  = B2  + tuple(range(240, 252))              # Bob turn 3 (long)

tree = RadixTree()
for seq in (SYS, A1, A2, B1, B2, B3):
  tree.insert(seq)

r = tree.report()
print(f"tokens seen: {r['tokens_seen']}")
print(f"tokens hit:  {r['tokens_hit']}")
print(f"hit rate:    {r['hit_rate']:.1%}")
print()
print("With six sequences sharing a 20-token system prompt and forking,")
print("we already see substantial prefix reuse. Production traffic is far more redundant.")

from collections import defaultdict

class RadixNode:
  __slots__ = ('children', 'tokens', 'value')
  def __init__(self, tokens=()):
      self.children = {}     # first-token → RadixNode
      self.tokens = tokens   # full token tuple stored along this edge
      self.value = None      # would be the KV blocks; we just track presence

class RadixTree:
  def __init__(self):
      self.root = RadixNode()
      self.tokens_seen = 0
      self.tokens_hit = 0

  @staticmethod
  def lcp(a, b):
      i = 0
      while i < len(a) and i < len(b) and a[i] == b[i]: i += 1
      return i

  def insert(self, seq):
      self.tokens_seen += len(seq)
      node = self.root
      i = 0
      while i < len(seq):
          first = seq[i]
          if first not in node.children:
              # No child starting with this token — fresh edge.
              child = RadixNode(tuple(seq[i:]))
              child.value = True
              node.children[first] = child
              return
          child = node.children[first]
          n = self.lcp(child.tokens, seq[i:])
          self.tokens_hit += n
          if n == len(child.tokens):
              i += n
              node = child                       # walked into existing edge
          else:
              # Need to split the edge at offset n.
              old_tail = RadixNode(child.tokens[n:])
              old_tail.value, old_tail.children = child.value, child.children
              child.tokens = child.tokens[:n]
              child.value = None
              child.children = {old_tail.tokens[0]: old_tail}
              if i + n < len(seq):
                  new_tail = RadixNode(tuple(seq[i+n:]))
                  new_tail.value = True
                  child.children[seq[i+n]] = new_tail
              else:
                  child.value = True
              return

  def report(self):
      return {
          'tokens_seen': self.tokens_seen,
          'tokens_hit':  self.tokens_hit,
          'hit_rate':    self.tokens_hit / max(1, self.tokens_seen),
      }

# Simulate a chat workload: shared system prompt, two users, branching turns.
SYS = tuple(range(1, 21))                       # 20 tokens of system prompt
A1  = SYS + tuple(range(100, 110))              # Alice turn 1
A2  = A1  + tuple(range(120, 128))              # Alice turn 2
B1  = SYS + tuple(range(200, 215))              # Bob turn 1
B2  = B1  + tuple(range(220, 230))              # Bob turn 2
B3  = B2  + tuple(range(240, 252))              # Bob turn 3 (long)

tree = RadixTree()
for seq in (SYS, A1, A2, B1, B2, B3):
  tree.insert(seq)

r = tree.report()
print(f"tokens seen: {r['tokens_seen']}")
print(f"tokens hit:  {r['tokens_hit']}")
print(f"hit rate:    {r['hit_rate']:.1%}")
print()
print("With six sequences sharing a 20-token system prompt and forking,")
print("we already see substantial prefix reuse. Production traffic is far more redundant.")

Ctrl+Enter to run

Quick check

Fill in the blank

A cache hit must always start from the very first token of the prompt — you cannot skip a missed block in the middle. The technical reason:

Why are KV vectors at position t a function of the entire preceding context?

Quick check

A team's chat product has a 4K-token system prompt. They're seeing 5% prefix-cache hit rate. The biggest single thing they can change?

Key takeaways

Prefix caching turns repeated prompts into hash lookups. TTFT goes from “prefill cost” to “memcpy cost.”
Hits must be contiguous from the front — a single varying byte at offset 0 invalidates the entire downstream cache.
Hit rate is the metric. Track it. If it’s below 50% on a chat workload, the prompt template is the bug.
vLLM v1 = automatic prefix caching by default. SGLang’s RadixAttention does the same with a tree, which wins on branching workloads.
The savings stack. PagedAttention tames intra-request memory; prefix caching tames inter-request duplication; speculative decoding tames decode latency. They compose.

Go deeper

PaperSGLang: Efficient Execution of Structured Language Model Programs · Zheng et al., 2024Section 4 introduces RadixAttention. The clearest published description of a tree-structured KV cache.
PaperEfficient Memory Management for LLM Serving with PagedAttention · Kwon et al., SOSP 2023The vLLM paper. APC is a follow-on; this is the foundation.
BlogvLLM — Automatic Prefix CachingAuthoritative explainer with benchmarks. Read this and the v1 update.
BlogSGLang v0.4 — RadixAttention with Llama 3 · LMSYS, 2024Hit-rate plots on real chat workloads.
DocsvLLM docs — Prefix CachingHow to enable, monitor, and debug. Includes the metric names you want in Grafana.
Reposgl-project/sglangReference implementation. `python/sglang/srt/mem_cache/radix_cache.py` is the data structure; `model_runner.py` shows how it integrates with prefill.
Repovllm-project/vllmSee `vllm/v1/core/kv_cache_manager.py` for the v1 prefix-cache implementation.