KV Cache

You can’t recompute the entire prefix on every generated token; the KV cache is what lets you avoid that. But the cache also dominates LLM serving memory — which is why PagedAttention, prefix caching, and KV quantization are central topics in modern serving systems.

0 / 4 lessons~62 min total

Module capstone — build it

PagedAttention from scratch — in 300 lines

Build the OS-style KV-cache manager that powers vLLM. Watch fragmentation disappear in front of you.

AdvancedOne focused weekend (~8 h)Free Colab T4

A pure-Python paged KV cache with a block table, block allocator, free-list, and copy-on-write for branching sequences. The artifact is the simulator + a fragmentation plot vs naive contiguous cache + a one-page README.

Build it — step by step

01Define a Sequence object and a naive contiguous cache45 min
Each Sequence has: id, prompt_len, output_len, current_pos. The naive cache is a single big tensor `[max_concurrent_seqs, max_seq_len, n_layers, n_heads, d_head]`. Allocate per sequence at peak length.
checkpoint You can run 100 sequences with varied output lengths and the naive cache reports its memory waste (allocated minus actually-used).
watch out Allocating per-sequence at the *prompt-only* length forgets that decode also needs space; you'll OOM at runtime, not at allocation.
02Build the block allocator — fixed-size KV blocks of 16 tokens90 min
Allocate one big pool of blocks. Each block holds 16 tokens worth of KV across all layers/heads. Maintain a free-list (a `deque`). Track which sequence owns which blocks via a per-sequence block_table list.
checkpoint Allocate 1000 blocks, hand them out to 100 sequences as needed, free as sequences finish, never run out.
watch out Don't pre-allocate worst-case per sequence — the whole point of paging is on-demand. Allocate a new block only when the sequence crosses a 16-token boundary.
03Logical-to-physical address translation60 min
Given a sequence and a token position t, return the physical block id from the block_table and the offset within the block (`t // 16, t % 16`). This is the "page table walk" that the paged attention kernel does on the GPU.
checkpoint You can append 1000 tokens to a sequence and read any of them back without bugs.
04Copy-on-write for branching sequences60 min
Implement `fork(seq) -> new_seq` that shares the prefix block_table with the parent. When either sequence writes to a shared block, copy it. Useful for beam search and parallel sampling.
checkpoint Fork a sequence at position 100, both branches generate 50 more tokens, the original blocks 0–6 stay shared, blocks 7+ diverge.
watch out Reference-count blocks, not sequences. A block can be shared by N>2 sequences; your refcount logic must increment per fork and decrement per release.
05Simulate 1000 sequences and plot fragmentation90 min
Generate a workload: arrival times Poisson, prompts uniform [200, 4000], outputs uniform [50, 500]. Run for N steps. At each step compute (used_blocks / total_blocks) and (logical_tokens / max_logical) for naive and paged. Plot.
checkpoint Plot shows naive cache hits its waste ceiling (~50%) early; paged stays near 100% utilization.
watch out For the naive comparison to be fair, both versions must serve the same workload. Use a fixed random seed.
06README + push45 min
README explains the block-table data structure with one diagram, embeds the fragmentation plot, links the PagedAttention paper. Single-file impl + benchmark.
checkpoint A reader runs `python simulate.py` and gets the plot.

You walk away with

A working paged KV-cache manager in pure Python — the same data structure vLLM ships
Fluency with block tables, free-lists, copy-on-write — OS-style memory management
A fragmentation benchmark you can re-run when comparing serving stacks
A clean repo that maps directly to the PagedAttention paper figures

Tools you'll use

PyTorch
vLLM source (reference)
PagedAttention paper (Kwon et al., SOSP 2023)
matplotlib for the fragmentation plot