Disaggregated Serving
Prereqs: KV Cache Basics, PagedAttention, Prefix & RadixAttention. This lesson is where the cache lives and how it gets there.
When a request hits Mooncake (Kimi’s serving stack) or any 2025-era frontier production system, the architecture isn’t “one GPU runs your prompt forward and streams tokens back.” It’s two pools. A prefill pool runs your prompt through every layer in one giant compute-heavy matmul. The resulting is then transferred over NVLink or RDMA to a decode pool that streams tokens to you. Two physically distinct GPU jobs, separated by a KV transport, glued by a router.
The reason is one fact about LLM inference: prefill is compute-bound and decode is HBM-bandwidth-bound, and they want opposite hardware configurations. Prefill loves big tensor parallelism and small batches (the prompt’s N tokens already give parallelism). Decode loves small tensor parallelism and giant batches (amortize the cost of re-reading the cache across many users). Force them onto the same GPU and you get a compromise both jobs hate: prefill stalls the decode batch, decode dilutes prefill’s MFU. Pull them apart and each pool runs at the right shape — net throughput goes up 2–5× under the same SLOs.
Disaggregation was published as DistServe (OSDI 2024), shipped at scale in Mooncake (Moonshot, 2024), and now lives in production form in vLLM-disagg, SGLang-disagg, and NVIDIA’s NIXL/Dynamo stack. By 2026 it’s standard for any frontier serving deployment doing 8K+ contexts at scale. The transport — typically a few-millisecond NVLink hop or 10s-of-millisecond RDMA — hides behind the first decode step via layer-wise pipelining.
TL;DR
- Prefill is compute-bound (long parallel matmul on the prompt). Decode is memory-bandwidth-bound (one token at a time, re-reading the whole KV cache). Running them on the same GPU forces a compromise both jobs hate.
- Disaggregated serving = two GPU pools. The prefill pool runs prompts to completion, then transfers KV to a decode pool that streams tokens to the user. Each pool is sized and tuned for its job.
- Pioneered as a research idea in DistServe (OSDI 2024) and shipped at scale in Mooncake (Moonshot AI, 2024) and vLLM-disagg / SGLang-disagg (2024–2025). By 2026 it’s standard in any frontier-model serving stack.
- The KV transfer between pools is the crux. NVLink intra-node, RDMA / NIXL inter-node. Round-trip cost is usually 5–30 ms — hidden behind the first decode step.
- Net win: 3–4× more decode throughput at the same SLO, or 2× tighter TTFT at the same throughput. Big wins for long contexts and SLOs that bound TTFT and TPOT (time-per-output-token) separately.
Why TTFT and TPOT are separate SLOs
Modern SLOs aren’t a single latency number — they’re a pair. TTFT (time-to-first-token) controls how chat feels. TPOT (time-per-output-token) controls how fast it streams. On a single co-located GPU, prefill steals tokens from decode (queueing) and decode pollutes the prefill batch (low utilization). You end up over-provisioning to satisfy both. Splitting the workload lets each pool run at the right batch size, the right tensor parallelism, and the right scheduler — so the same fleet serves dramatically more users at the same SLO.
This is also the architectural shift that made 128K+ context viable in production at frontier labs. Long-context prefill is brutal; co-locating it with decode caps how many concurrent decoders you can interleave. Disaggregation removes the cap.
Mental model
Prefill pools run few sequences at a time with high tensor parallelism — what compute-heavy matmuls love. Decode pools run lots of sequences at a time with lower tensor parallelism but maximum batch — what memory-bandwidth-bound decode loves. The KV cache moves in the middle.
Why one GPU is bad at both jobs
| Phase | Bottleneck | Loves | Hates |
|---|---|---|---|
| Prefill | FLOPs | Big TP, low batch (tokens already provide parallelism via N) | Decoders interrupting batch shape |
| Decode | HBM bandwidth | Big batch (amortize KV reads) | Long prefill stalling the batch |
Run them together and the scheduler has to pick: serve a fresh long prompt → decode batch starves; pack many short prompts in a chunked-prefill schedule → prefill is sliced into less-efficient chunks. Chunked prefill (next module) is the co-located compromise. Disaggregation is the pull-them-apart answer.
What “disaggregated” actually does
- The router receives a request. It picks a prefill instance (idle or shortest queue).
- The prefill instance runs the full prompt forward, materializing K, V across all
Llayers and all KV heads. - The router picks a decode instance (lowest-latency batch slot available).
- The KV cache is transferred — usually layer by layer, often overlapped with the next layer’s prefill — to the decode instance’s HBM via NVLink (intra-node) or RDMA (inter-node). vLLM-disagg uses NIXL (NVIDIA’s transfer library); SGLang uses Mooncake’s transfer engine.
- The decode instance runs autoregressive generation. The user sees tokens streaming.
The transfer cost is real but bounded. For a 70B model with GQA and a 4K prompt, KV is ~1 GB; 25 GB/s NVLink → ~40 ms. The first-decode step starts as soon as the first layer’s KV arrives — overlapping the rest of the transfer. Net visible TTFT is prefill_time + ~5 ms.
How the math changes
Take a 70B model on H100, 8K prompt, 256 output tokens, 1024 concurrent users at the same throughput.
Co-located (single pool, TP=8, chunked prefill):
- Practical decode batch: ~32 (any larger and TPOT exceeds SLO due to long-prefill chunks).
- Prefill MFU: ~25% (chunked schedule + decode interleaving).
- Sustained throughput: limited by decode batch.
Disaggregated (prefill TP=8 batch=2, decode TP=2 batch=128):
- Prefill MFU: ~40% (no decode contamination; can run prompts back-to-back).
- Decode batch: ~128 (no prefill stalls).
- Sustained throughput: 2.8–3.2× higher under same TTFT/TPOT SLO. Mooncake reports up to 5× on long-context workloads.
Numbers are illustrative — your actual ratio depends on prompt-to-output ratio, KV size per request, and interconnect.
The KV transfer is the only hard part
Three knobs:
- Layer-wise pipelining. Don’t wait for all
Llayers’ KV to arrive — start decoding as soon as layer 0 lands. Each subsequent layer’s KV must arrive before that layer’s first decode step. - Compression. Some stacks compress KV during transfer (e.g., FP8 → INT8 for the wire) and decompress in decode. Saves bandwidth at small accuracy cost.
- Topology awareness. Schedule prefill→decode pairs that share an NVLink switch when possible. Cross-node RDMA is 4–10× slower than intra-node NVLink.
# Skeleton of the transfer engine, ignoring layer pipelining.
def serve(req):
prefill_inst = router.pick_prefill()
kv = prefill_inst.run_prefill(req.prompt) # (L, kv_heads, T, d_head)
decode_inst = router.pick_decode()
handle = transport.send_kv(
src=prefill_inst,
dst=decode_inst,
kv=kv,
layer_pipeline=True, # start decoding as layer 0 arrives
)
return decode_inst.stream_decode(req, kv_handle=handle)Real implementations (Mooncake’s Transfer Engine, NVIDIA NIXL, SGLang’s PD-disagg pipeline) wrap this with retry, flow control, and layer-level synchronization primitives.
When not to disaggregate
- Short prompts (under ~256 tokens) — prefill is so cheap that co-location is fine. Transfer overhead dominates.
- Tiny fleets (fewer than 8 GPUs) — fixed-cost overhead of two pools eats the win.
- Single-tenant batch jobs — no SLO pressure means the chunked-prefill compromise is fine.
For chat, agents, evals, retrieval-heavy workloads, or anything pushing 8K+ contexts, disaggregation pays.
Run it in your browser
A back-of-envelope simulator. Vary prompt length, output length, KV transfer bandwidth — see when disaggregation wins.
The numbers are stylized but the shape is right: as prompt length grows, the prefill phase gets long enough that co-located decode batches stall, and disaggregation’s win compounds.
Quick check
Key takeaways
- Prefill and decode are different jobs. Compute-bound vs bandwidth-bound, big TP vs big batch. Forcing them onto the same GPU compromises both.
- Disaggregation = two pools + a KV transport. Each pool tunes for its job; the transport (NVLink/RDMA) hides behind the first decode step via layer pipelining.
- The throughput multiplier is real. Published results: 2–5× under same SLO. Bigger wins as prompts get longer.
- vLLM-disagg, SGLang-disagg, and Mooncake are the three production references in 2025–2026. All open-source, all built on a
Transfer Engineabstraction. - Disaggregation, prefix caching, paged attention, and chunked prefill are orthogonal optimizations. A modern stack uses all four. Each addresses a different inefficiency in the same KV-bound pipeline.
Go deeper
- PaperDistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM ServingThe paper that named the technique. Read sections 3 and 4 for the goodput formulation that drives every later system.
- PaperMooncake: A KVCache-centric Disaggregated Architecture for LLM ServingProduction system serving Kimi at scale. Detailed numbers on 100B+ context disaggregation.
- BlogvLLM — Disaggregated Prefill / DecodeAuthoritative writeup with benchmarks and config knobs. Pair with the docs.
- BlogSGLang — Large-Scale Disaggregated ServingReal-fleet numbers including Mooncake-transport integration.
- DocsNVIDIA NIXL / DynamoNIXL is the KV-transfer primitive most disaggregated stacks now build on. Dynamo wraps it for serving.
- Repokvcache-ai/MooncakeThe transfer engine in production-ready form. Read `mooncake-transfer-engine` for the wire protocol.
- Reposgl-project/sglangSee `python/sglang/srt/disaggregation/` for the prefill→decode pipeline.
- Repovllm-project/vllmSee `vllm/v1/distributed/` and the disagg-related entries in `vllm/v1/core/`.
Prereqs: KV Cache Basics, PagedAttention, Prefix & RadixAttention. This lesson is where the cache lives and how it gets there.
TL;DR
- Prefill is compute-bound (long parallel matmul on the prompt). Decode is memory-bandwidth-bound (one token at a time, re-reading the whole KV cache). Running them on the same GPU forces a compromise both jobs hate.
- Disaggregated serving = two GPU pools. The prefill pool runs prompts to completion, then transfers KV to a decode pool that streams tokens to the user. Each pool is sized and tuned for its job.
- Pioneered as a research idea in DistServe (OSDI 2024) and shipped at scale in Mooncake (Moonshot AI, 2024) and vLLM-disagg / SGLang-disagg (2024–2025). By 2026 it’s standard in any frontier-model serving stack.
- The KV transfer between pools is the crux. NVLink intra-node, RDMA / NIXL inter-node. Round-trip cost is usually 5–30 ms — hidden behind the first decode step.
- Net win: 3–4× more decode throughput at the same SLO, or 2× tighter TTFT at the same throughput. Big wins for long contexts and SLOs that bound TTFT and TPOT (time-per-output-token) separately.
Why this matters
Modern SLOs aren’t a single latency number — they’re a pair. TTFT (time-to-first-token) controls how chat feels. TPOT (time-per-output-token) controls how fast it streams. On a single co-located GPU, prefill steals tokens from decode (queueing) and decode pollutes the prefill batch (low utilization). You end up over-provisioning to satisfy both. Splitting the workload lets each pool run at the right batch size, the right tensor parallelism, and the right scheduler — so the same fleet serves dramatically more users at the same SLO.
This is also the architectural shift that made 128K+ context viable in production at frontier labs. Long-context prefill is brutal; co-locating it with decode caps how many concurrent decoders you can interleave. Disaggregation removes the cap.
Mental model
Prefill pools run few sequences at a time with high tensor parallelism — what compute-heavy matmuls love. Decode pools run lots of sequences at a time with lower tensor parallelism but maximum batch — what memory-bandwidth-bound decode loves. The KV cache moves in the middle.
Concrete walkthrough
Why one GPU is bad at both jobs
| Phase | Bottleneck | Loves | Hates |
|---|---|---|---|
| Prefill | FLOPs | Big TP, low batch (tokens already provide parallelism via N) | Decoders interrupting batch shape |
| Decode | HBM bandwidth | Big batch (amortize KV reads) | Long prefill stalling the batch |
Run them together and the scheduler has to pick: serve a fresh long prompt → decode batch starves; pack many short prompts in a chunked-prefill schedule → prefill is sliced into less-efficient chunks. Chunked prefill (next module) is the co-located compromise. Disaggregation is the pull-them-apart answer.
What “disaggregated” actually does
- The router receives a request. It picks a prefill instance (idle or shortest queue).
- The prefill instance runs the full prompt forward, materializing K, V across all
Llayers and all KV heads. - The router picks a decode instance (lowest-latency batch slot available).
- The KV cache is transferred — usually layer by layer, often overlapped with the next layer’s prefill — to the decode instance’s HBM via NVLink (intra-node) or RDMA (inter-node). vLLM-disagg uses NIXL (NVIDIA’s transfer library); SGLang uses Mooncake’s transfer engine.
- The decode instance runs autoregressive generation. The user sees tokens streaming.
The transfer cost is real but bounded. For a 70B model with GQA and a 4K prompt, KV is ~1 GB; 25 GB/s NVLink → ~40 ms. The first-decode step starts as soon as the first layer’s KV arrives — overlapping the rest of the transfer. Net visible TTFT is prefill_time + ~5 ms.
How the math changes
Take a 70B model on H100, 8K prompt, 256 output tokens, 1024 concurrent users at the same throughput.
Co-located (single pool, TP=8, chunked prefill):
- Practical decode batch: ~32 (any larger and TPOT exceeds SLO due to long-prefill chunks).
- Prefill MFU: ~25% (chunked schedule + decode interleaving).
- Sustained throughput: limited by decode batch.
Disaggregated (prefill TP=8 batch=2, decode TP=2 batch=128):
- Prefill MFU: ~40% (no decode contamination; can run prompts back-to-back).
- Decode batch: ~128 (no prefill stalls).
- Sustained throughput: 2.8–3.2× higher under same TTFT/TPOT SLO. Mooncake reports up to 5× on long-context workloads.
Numbers are illustrative — your actual ratio depends on prompt-to-output ratio, KV size per request, and interconnect.
The KV transfer is the only hard part
Three knobs:
- Layer-wise pipelining. Don’t wait for all
Llayers’ KV to arrive — start decoding as soon as layer 0 lands. Each subsequent layer’s KV must arrive before that layer’s first decode step. - Compression. Some stacks compress KV during transfer (e.g., FP8 → INT8 for the wire) and decompress in decode. Saves bandwidth at small accuracy cost.
- Topology awareness. Schedule prefill→decode pairs that share an NVLink switch when possible. Cross-node RDMA is 4–10× slower than intra-node NVLink.
# Skeleton of the transfer engine, ignoring layer pipelining.
def serve(req):
prefill_inst = router.pick_prefill()
kv = prefill_inst.run_prefill(req.prompt) # (L, kv_heads, T, d_head)
decode_inst = router.pick_decode()
handle = transport.send_kv(
src=prefill_inst,
dst=decode_inst,
kv=kv,
layer_pipeline=True, # start decoding as layer 0 arrives
)
return decode_inst.stream_decode(req, kv_handle=handle)Real implementations (Mooncake’s Transfer Engine, NVIDIA NIXL, SGLang’s PD-disagg pipeline) wrap this with retry, flow control, and layer-level synchronization primitives.
When not to disaggregate
- Short prompts (under ~256 tokens) — prefill is so cheap that co-location is fine. Transfer overhead dominates.
- Tiny fleets (fewer than 8 GPUs) — fixed-cost overhead of two pools eats the win.
- Single-tenant batch jobs — no SLO pressure means the chunked-prefill compromise is fine.
For chat, agents, evals, retrieval-heavy workloads, or anything pushing 8K+ contexts, disaggregation pays.
Run it in your browser
A back-of-envelope simulator. Vary prompt length, output length, KV transfer bandwidth — see when disaggregation wins.
The numbers are stylized but the shape is right: as prompt length grows, the prefill phase gets long enough that co-located decode batches stall, and disaggregation’s win compounds.
Quick check
Key takeaways
- Prefill and decode are different jobs. Compute-bound vs bandwidth-bound, big TP vs big batch. Forcing them onto the same GPU compromises both.
- Disaggregation = two pools + a KV transport. Each pool tunes for its job; the transport (NVLink/RDMA) hides behind the first decode step via layer pipelining.
- The throughput multiplier is real. Published results: 2–5× under same SLO. Bigger wins as prompts get longer.
- vLLM-disagg, SGLang-disagg, and Mooncake are the three production references in 2025–2026. All open-source, all built on a
Transfer Engineabstraction. - Disaggregation, prefix caching, paged attention, and chunked prefill are orthogonal optimizations. A modern stack uses all four. Each addresses a different inefficiency in the same KV-bound pipeline.
Go deeper
- PaperDistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM ServingThe paper that named the technique. Read sections 3 and 4 for the goodput formulation that drives every later system.
- PaperMooncake: A KVCache-centric Disaggregated Architecture for LLM ServingProduction system serving Kimi at scale. Detailed numbers on 100B+ context disaggregation.
- BlogvLLM — Disaggregated Prefill / DecodeAuthoritative writeup with benchmarks and config knobs. Pair with the docs.
- BlogSGLang — Large-Scale Disaggregated ServingReal-fleet numbers including Mooncake-transport integration.
- DocsNVIDIA NIXL / DynamoNIXL is the KV-transfer primitive most disaggregated stacks now build on. Dynamo wraps it for serving.
- Repokvcache-ai/MooncakeThe transfer engine in production-ready form. Read `mooncake-transfer-engine` for the wire protocol.
- Reposgl-project/sglangSee `python/sglang/srt/disaggregation/` for the prefill→decode pipeline.
- Repovllm-project/vllmSee `vllm/v1/distributed/` and the disagg-related entries in `vllm/v1/core/`.