PagedAttention (vLLM)
When you serve a model with vllm.LLM(model="meta-llama/Llama-3.1-70B") and start sending requests, you can hold ~30 concurrent sequences on a single H100. Try the same with stock HuggingFace model.generate() and you’ll OOM at five. Same hardware, same model, same KV-cache math — 6× more users. The reason is one paper from SOSP 2023 (Kwon et al.) and one data-structure trick: treating the like virtual memory.
The naive cache reserves a contiguous slab equal to max_seq_length for every sequence in the batch. Most sequences finish at 200 tokens; you reserved 8K. 60–80% of GPU memory is wasted on internal fragmentation. The fix is exactly what an operating system does for processes: cut the cache into small fixed-size blocks (16 tokens each), keep a block table per sequence pointing to wherever its blocks happened to land, and allocate dynamically from a free list. Throughput goes up 2–6× from a data-structure change.
This is the single most impactful kernel-level change in LLM serving since FlashAttention. If you operate any LLM in production — vLLM, SGLang, TGI, TensorRT-LLM — you’re operating on top of paged storage. The page table is the abstraction.
TL;DR
- The naive KV cache reserves a contiguous slab equal to the max context for each sequence. Most sequences are shorter — 60–80% of GPU memory is wasted on internal fragmentation.
- PagedAttention (Kwon et al., SOSP 2023, the vLLM paper) treats the cache like an OS treats virtual memory: small fixed-size blocks + per-sequence block tables + a block allocator.
- Result: 2–4× higher throughput than HuggingFace TGI on the same hardware. vLLM’s defaults match the paper’s design.
- vLLM v1 (late 2024) refactored the engine around chunked prefill + PagedAttention as the unified primitive. SGLang’s RadixAttention is the alternative paradigm — same paged storage, plus prefix-tree sharing.
- This is the single most impactful kernel-level change in LLM serving since FlashAttention. Required reading.
Why memory layout dominates throughput
Before PagedAttention, an inference server with ~80 GB GPU memory could serve ~5 concurrent Llama-2-7B sequences at long context, because each sequence reserved its full max-token KV slab — even if it only ever produced 200 tokens.
After PagedAttention: same hardware, 30–50 concurrent sequences. The math didn’t change. The memory layout did. A 6× throughput jump from a data-structure change is rare in systems.
If you’re going to operate any LLM in production, you need to read the vLLM paper and understand the page table.
Mental model — the OS analogy
The cache is one big pool of small blocks (e.g., 16 tokens each). Each sequence holds a list of block IDs — its page table — pointing to wherever its KV pages happen to live. Allocation is dynamic; freeing is per-block.
This is exactly virtual memory, applied to GPU tensors.
The fragmentation problem
Naive serving allocates a max-context KV slab per sequence:
kv_cache = torch.empty(
(batch_max, seq_max, n_layers, n_kv_heads, head_dim, 2), # 2 for K and V
device='cuda', dtype=torch.bfloat16
)If seq_max = 8192 and most sequences finish at length 256, you’ve reserved ~32× the memory you actually use. You can’t pack new sequences into that space because each slot is committed to one sequence.
For a 70B model with 8 KV heads, head_dim 128, 80 layers, BF16, max_seq 8K, batch 16:
wasted slab = 2 × 80 × 8 × 128 × 8192 × 2 × 16 ≈ 80 GB
— exactly the memory of an H100. And most of those slots are empty most of the time.
How PagedAttention organizes memory
# Each block holds B contiguous tokens worth of (K, V) for all heads, all layers
BLOCK_SIZE = 16
# Pool of blocks — a single big tensor
kv_pool = torch.empty(
(n_blocks, n_layers, 2, BLOCK_SIZE, n_kv_heads, head_dim),
device='cuda', dtype=torch.bfloat16
)
free_blocks = list(range(n_blocks)) # the "free list"
# Per sequence, a page table — a list of block IDs
class SequenceKV:
page_table: list[int] # ordered list of block IDs
last_block_used: int # how many slots in the last block are occupied
# Adding a new token:
def append_token(seq, k_new, v_new):
if seq.last_block_used == BLOCK_SIZE:
seq.page_table.append(allocate_new_block())
seq.last_block_used = 0
# write into kv_pool at (seq.page_table[-1], :, :, seq.last_block_used, :, :)
...
seq.last_block_used += 1When a sequence ends, its blocks return to the free list — no fragmentation, no compaction.
The custom attention kernel
A standard attention kernel assumes K/V are contiguous in memory. With paged storage, K[t] and K[t+1] might live in different blocks — the kernel has to consult the page table to find each one.
vLLM’s PagedAttention kernel does this: it walks the page table for each query, gathers the K/V slabs for each block, and runs the attention math. Implemented in CUDA + integrated with FlashAttention’s tiling. Performance is within 5% of contiguous attention on Hopper.
Two extra tricks PagedAttention enables
Copy-on-write for branched sequences. Beam search and parallel sampling generate multiple continuations from the same prompt. With paged storage, all branches share the prompt blocks read-only and only allocate new blocks when they diverge. Memory savings: branch factor.
Prefix caching across requests. If two requests share a system prompt, they share the same blocks. SGLang’s RadixAttention takes this further with a radix tree of shared prefixes. For a serving system answering many queries with the same long preamble, this saves enormous KV memory.
Real numbers — from the vLLM SOSP paper
Llama 7B on a single A100, OPT trace:
| System | Throughput (req/s) | Latency p99 (s) | Memory waste |
|---|---|---|---|
HuggingFace generate | 4.5 | 28 | 73% |
| FasterTransformer | 7.2 | 18 | 50% |
| vLLM | 23.0 | 6 | 4% |
5× the throughput of HuggingFace, 3× of FasterTransformer. From the same model on the same GPU.
By 2025–2026, vLLM v1 + chunked prefill + speculative decoding ships another ~1.5–2× on top of this.
Run it in your browser — page-table simulator
You’ll see naive wastes ~80%+ on this distribution; paged wastes under 5%. Same workload, ~20× less memory.
Quick check
Key takeaways
- PagedAttention = OS virtual memory for the KV cache. Fixed-size blocks, per-sequence page tables, dynamic allocation, ~zero fragmentation.
- The custom kernel handles non-contiguous K/V access. Within ~5% of FlashAttention contiguous performance.
- Two structural unlocks come for free: copy-on-write (cheap branching) and cross-request prefix sharing (huge for templated prompts).
- vLLM’s defaults are the paper’s design. If you operate vLLM, tune
block_size(default 16) andgpu_memory_utilization(default 0.9). Most other knobs are second-order. - SGLang’s RadixAttention generalizes prefix sharing. Same paged storage; smarter sharing topology. Use it when you have a prompt forest, not just a prefix.
Go deeper
- PaperEfficient Memory Management for LLM Serving with PagedAttentionThe vLLM paper. Required reading. Read it twice.
- BlogvLLM v0.6 perf updateHow chunked prefill + PagedAttention compose into the v1 engine.
- PaperEfficiently Programming Large Language Models using SGLangRadixAttention paper. Generalizes prefix sharing.
- Repovllm-project/vllmReference implementation. Read `vllm/core/block_manager.py` and `vllm/attention/backends/flash_attn.py`.
- Reposgl-project/sglangThe RadixAttention serving stack. Often beats vLLM on workloads with shared prefixes.
- VideoWoosuk Kwon — vLLM @ SOSP 2023Author talk. Cleanest 20-minute explanation of the design.
- PaperMooncake: Trading More Storage for Less Computation (2024)Disaggregated KV-cache architecture from Moonshot. Where serving systems are headed.
- DocsvLLM documentationOperational reference. Skim the "Engine Args" page once a quarter — it changes.
TL;DR
- The naive KV cache reserves a contiguous slab equal to the max context for each sequence. Most sequences are shorter — 60–80% of GPU memory is wasted on internal fragmentation.
- PagedAttention (Kwon et al., SOSP 2023, the vLLM paper) treats the cache like an OS treats virtual memory: small fixed-size blocks + per-sequence block tables + a block allocator.
- Result: 2–4× higher throughput than HuggingFace TGI on the same hardware. vLLM’s defaults match the paper’s design.
- vLLM v1 (late 2024) refactored the engine around chunked prefill + PagedAttention as the unified primitive. SGLang’s RadixAttention is the alternative paradigm — same paged storage, plus prefix-tree sharing.
- This is the single most impactful kernel-level change in LLM serving since FlashAttention. Required reading.
Why this matters
Before PagedAttention, an inference server with ~80 GB GPU memory could serve ~5 concurrent Llama-2-7B sequences at long context, because each sequence reserved its full max-token KV slab — even if it only ever produced 200 tokens.
After PagedAttention: same hardware, 30–50 concurrent sequences. The math didn’t change. The memory layout did. A 6× throughput jump from a data-structure change is rare in systems.
If you’re going to operate any LLM in production, you need to read the vLLM paper and understand the page table.
Mental model — the OS analogy
The cache is one big pool of small blocks (e.g., 16 tokens each). Each sequence holds a list of block IDs — its page table — pointing to wherever its KV pages happen to live. Allocation is dynamic; freeing is per-block.
This is exactly virtual memory, applied to GPU tensors.
Concrete walkthrough
The fragmentation problem
Naive serving allocates a max-context KV slab per sequence:
kv_cache = torch.empty(
(batch_max, seq_max, n_layers, n_kv_heads, head_dim, 2), # 2 for K and V
device='cuda', dtype=torch.bfloat16
)If seq_max = 8192 and most sequences finish at length 256, you’ve reserved ~32× the memory you actually use. You can’t pack new sequences into that space because each slot is committed to one sequence.
For a 70B model with 8 KV heads, head_dim 128, 80 layers, BF16, max_seq 8K, batch 16:
wasted slab = 2 × 80 × 8 × 128 × 8192 × 2 × 16 ≈ 80 GB
— exactly the memory of an H100. And most of those slots are empty most of the time.
How PagedAttention organizes memory
# Each block holds B contiguous tokens worth of (K, V) for all heads, all layers
BLOCK_SIZE = 16
# Pool of blocks — a single big tensor
kv_pool = torch.empty(
(n_blocks, n_layers, 2, BLOCK_SIZE, n_kv_heads, head_dim),
device='cuda', dtype=torch.bfloat16
)
free_blocks = list(range(n_blocks)) # the "free list"
# Per sequence, a page table — a list of block IDs
class SequenceKV:
page_table: list[int] # ordered list of block IDs
last_block_used: int # how many slots in the last block are occupied
# Adding a new token:
def append_token(seq, k_new, v_new):
if seq.last_block_used == BLOCK_SIZE:
seq.page_table.append(allocate_new_block())
seq.last_block_used = 0
# write into kv_pool at (seq.page_table[-1], :, :, seq.last_block_used, :, :)
...
seq.last_block_used += 1When a sequence ends, its blocks return to the free list — no fragmentation, no compaction.
The custom attention kernel
A standard attention kernel assumes K/V are contiguous in memory. With paged storage, K[t] and K[t+1] might live in different blocks — the kernel has to consult the page table to find each one.
vLLM’s PagedAttention kernel does this: it walks the page table for each query, gathers the K/V slabs for each block, and runs the attention math. Implemented in CUDA + integrated with FlashAttention’s tiling. Performance is within 5% of contiguous attention on Hopper.
Two extra tricks PagedAttention enables
Copy-on-write for branched sequences. Beam search and parallel sampling generate multiple continuations from the same prompt. With paged storage, all branches share the prompt blocks read-only and only allocate new blocks when they diverge. Memory savings: branch factor.
Prefix caching across requests. If two requests share a system prompt, they share the same blocks. SGLang’s RadixAttention takes this further with a radix tree of shared prefixes. For a serving system answering many queries with the same long preamble, this saves enormous KV memory.
Real numbers — from the vLLM SOSP paper
Llama 7B on a single A100, OPT trace:
| System | Throughput (req/s) | Latency p99 (s) | Memory waste |
|---|---|---|---|
HuggingFace generate | 4.5 | 28 | 73% |
| FasterTransformer | 7.2 | 18 | 50% |
| vLLM | 23.0 | 6 | 4% |
5× the throughput of HuggingFace, 3× of FasterTransformer. From the same model on the same GPU.
By 2025–2026, vLLM v1 + chunked prefill + speculative decoding ships another ~1.5–2× on top of this.
Run it in your browser — page-table simulator
You’ll see naive wastes ~80%+ on this distribution; paged wastes under 5%. Same workload, ~20× less memory.
Quick check
Key takeaways
- PagedAttention = OS virtual memory for the KV cache. Fixed-size blocks, per-sequence page tables, dynamic allocation, ~zero fragmentation.
- The custom kernel handles non-contiguous K/V access. Within ~5% of FlashAttention contiguous performance.
- Two structural unlocks come for free: copy-on-write (cheap branching) and cross-request prefix sharing (huge for templated prompts).
- vLLM’s defaults are the paper’s design. If you operate vLLM, tune
block_size(default 16) andgpu_memory_utilization(default 0.9). Most other knobs are second-order. - SGLang’s RadixAttention generalizes prefix sharing. Same paged storage; smarter sharing topology. Use it when you have a prompt forest, not just a prefix.
Go deeper
- PaperEfficient Memory Management for LLM Serving with PagedAttentionThe vLLM paper. Required reading. Read it twice.
- BlogvLLM v0.6 perf updateHow chunked prefill + PagedAttention compose into the v1 engine.
- PaperEfficiently Programming Large Language Models using SGLangRadixAttention paper. Generalizes prefix sharing.
- Repovllm-project/vllmReference implementation. Read `vllm/core/block_manager.py` and `vllm/attention/backends/flash_attn.py`.
- Reposgl-project/sglangThe RadixAttention serving stack. Often beats vLLM on workloads with shared prefixes.
- VideoWoosuk Kwon — vLLM @ SOSP 2023Author talk. Cleanest 20-minute explanation of the design.
- PaperMooncake: Trading More Storage for Less Computation (2024)Disaggregated KV-cache architecture from Moonshot. Where serving systems are headed.
- DocsvLLM documentationOperational reference. Skim the "Engine Args" page once a quarter — it changes.