Speculative Decoding
When you start vLLM with --speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5, the engine quietly attaches a tiny draft model alongside your big target model and starts a different decode loop. Instead of “big model produces one token, append, repeat,” it becomes “small model guesses K tokens cheaply, big model checks all K in one shot, accept what it agrees with.” On chat and code workloads this typically doubles or triples decode throughput, and — this is the part that surprises people — it produces samples from exactly the same distribution as decoding alone. Same model, same quality, more tokens per second.
The reason it works is one fact: LLM decode is memory-bandwidth-bound, not compute-bound. Each generated token reads the entire model from HBM once. The compute for that one forward is barely used — at batch size 1, the streaming multiprocessors mostly sit idle waiting for weights. spends that idle compute on something useful: verifying several candidate tokens in parallel, in a single forward pass that costs roughly the same as one regular forward (the FFN dominates; attention over K tokens is small). Same memory traffic, more output. As of April 2026 it is standard in vLLM, SGLang, TGI, and TensorRT-LLM, and the choice is no longer “should I turn it on?” but “which variant — vanilla, Medusa, EAGLE-3, or PLD — fits my workload?”
TL;DR
- A small draft model generates candidate tokens. The big target model verifies all in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
- The math: a single -token target forward is much cheaper than sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
- EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
- Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
- Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.
Mental model
Draft = cheap. Verify = one big-model forward over a sequence of tokens — costs roughly the same as one forward over 1 token because attention over is small relative to the FFN work that dominates.
The math (rejection sampling)
Let be the target distribution at step and be the draft’s distribution. The draft samples . We accept it with probability . If rejected, we resample from the residual distribution normalized.
Crucial property: the resulting samples are exactly drawn from . The output distribution is identical to running the target alone. No quality change. This is the property that makes speculative decoding a free lunch rather than a quality tradeoff.
In practice many implementations use greedy (argmax) verification — accept iff target’s argmax matches the draft’s token. Slightly biased away from sampling but simpler and faster.
The cost model
Per “speculation round” of draft tokens (Leviathan et al., 2022, §3):
-
1 draft forward producing tokens: cost
-
1 target forward over tokens: cost (parallel, big-model FFN dominates)
-
Expected accepted tokens per round (geometric prefix + bonus token from rejection sampling):
where is the per-position acceptance probability.
Speedup vs naive ≈ .
For a 70B target with a 1B draft, , :
- tokens
- speedup
The four families
1. Vanilla speculative decoding (Leviathan et al., Chen et al., 2023)
Pair a small draft model (e.g., 1B) with a big target (e.g., 70B). Both must share the same tokenizer. ~70% acceptance rate; 2× speedup.
2. Medusa (Cai et al., 2024)
Add several small “Medusa heads” on top of the target’s last hidden state. Each head predicts a token at offset from current position in parallel. No separate draft model needed.
Last hidden state h_t → Medusa head 1 → t_{t+1}
→ Medusa head 2 → t_{t+2}
→ Medusa head 3 → t_{t+3}
...Verification: run target on the predicted tree, accept the longest matching path. Simpler than draft models; needs ~1B-token fine-tune for the heads. ~2× speedup, lower than full draft-model methods.
3. EAGLE-3 (Li et al., 2024 → 2025)
Auto-regressive draft head attached to the target. Each draft step takes the target’s previous hidden state + the previous predicted token, predicts the next. Sequential within a draft round, but very cheap because it shares the target’s hidden representations.
EAGLE-3 (2024 update with multi-step training) hits ~80% acceptance and 3× speedup on chat / instruction workloads. Currently the frontier as of April 2026.
4. Prompt Lookup Decoding (PLD) / lookahead decoding
No draft model at all. Speculate from n-gram matches in the prompt or recent output. If the model just generated “the FastAPI app” and the prompt contains “the FastAPI app routes requests…”, speculate “routes requests…” from the prompt.
Free at inference time, works best on summarization / RAG / code-completion where output strongly correlates with input. ~1.5–2× speedup on the right workloads, ~1× on creative tasks.
Real numbers — vLLM 0.7 + Llama-3.1-70B + Llama-3.2-1B draft
| Method | tok/s | Speedup | Acceptance |
|---|---|---|---|
| Baseline (no spec) | 38 | 1.0× | — |
| Vanilla spec, | 78 | 2.05× | 71% |
| Medusa-2 | 72 | 1.89× | — |
| EAGLE-3 | 108 | 2.84× | 78% |
| PLD (RAG workload) | 60 | 1.58× | n-gram dep. |
EAGLE-3 is the winner on chat workloads through 2025–2026 by a clear margin.
Run it in your browser — the math, visualized
You’ll see speedup peaks around – for common acceptance rates. Past that, additional draft tokens are unlikely to be accepted and the draft cost catches up.
Quick check
Key takeaways
- Speculative decoding is a free lunch. Same model, same quality, 2–3× decode throughput. Standard in 2026.
- EAGLE-3 is the current frontier for chat/instruction workloads — ~3× speedup, ~80% acceptance.
- Medusa is simpler but lower-ceiling — good for a quick win on a single model.
- PLD is workload-specific — RAG / summarization / code wins for free, creative tasks gain ~nothing.
- The math is simple, the kernels are not. Verify-K-tokens-in-one-forward needs custom attention kernels; this is why the technique was theoretical for years before vLLM/SGLang/TRT-LLM made it production-ready.
Go deeper
- PaperFast Inference from Transformers via Speculative DecodingOne of the two original papers (the other is Chen et al., 2023). Foundational.
- PaperMedusa: Simple LLM Inference Acceleration FrameworkThe Medusa paper. Multi-head parallel speculation.
- PaperEAGLE-3: Scaling up Inference Acceleration of Large Language ModelsFrontier through 2026. Auto-regressive draft heads with deep training.
- PaperPrompt Lookup DecodingPLD. No draft model needed. Short paper.
- DocsvLLM — speculative decoding docsHow to enable in vLLM.
- BlogFireworks AI — Multi-Token PredictionProduction lessons from running speculative decoding at scale.
- VideoYam Peleg — Speculative Decoding ExplainedBest 25-min visual walkthrough.
- RepoSafeAILab/EAGLEEAGLE-3 reference implementation.
TL;DR
- A small draft model generates candidate tokens. The big target model verifies all in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
- The math: a single -token target forward is much cheaper than sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
- EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
- Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
- Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.
Why this matters
LLM decode is memory-bound: each token reads the entire model from HBM. The compute for one forward is barely used (batch 1 = empty SMs). Speculative decoding turns this into a budget — use the spare compute to verify multiple candidate tokens at once. Same memory traffic, more output tokens.
The result is the only “free lunch” in modern inference: no quality loss, real throughput gain. As of April 2026 it’s standard in vLLM, SGLang, TGI, TensorRT-LLM.
Mental model
Draft = cheap. Verify = one big-model forward over a sequence of tokens — costs roughly the same as one forward over 1 token because attention over is small relative to the FFN work that dominates.
Concrete walkthrough
The math (rejection sampling)
Let be the target distribution at step and be the draft’s distribution. The draft samples . We accept it with probability . If rejected, we resample from the residual distribution normalized.
Crucial property: the resulting samples are exactly drawn from . The output distribution is identical to running the target alone. No quality change.
In practice many implementations use greedy (argmax) verification — accept iff target’s argmax matches the draft’s token. Slightly biased away from sampling but simpler and faster.
The cost model
Per “speculation round” of draft tokens (Leviathan et al., 2022, §3):
-
1 draft forward producing tokens: cost
-
1 target forward over tokens: cost (parallel, big-model FFN dominates)
-
Expected accepted tokens per round (geometric prefix + bonus token from rejection sampling):
where is the per-position acceptance probability.
Speedup vs naive ≈ .
For a 70B target with a 1B draft, , :
- tokens
- speedup
The four families
1. Vanilla speculative decoding (Leviathan et al., Chen et al., 2023)
Pair a small draft model (e.g., 1B) with a big target (e.g., 70B). Both must share the same tokenizer. ~70% acceptance rate; 2× speedup.
2. Medusa (Cai et al., 2024)
Add several small “Medusa heads” on top of the target’s last hidden state. Each head predicts a token at offset from current position in parallel. No separate draft model needed.
Last hidden state h_t → Medusa head 1 → t_{t+1}
→ Medusa head 2 → t_{t+2}
→ Medusa head 3 → t_{t+3}
...Verification: run target on the predicted tree, accept the longest matching path. Simpler than draft models; needs ~1B-token fine-tune for the heads. ~2× speedup, lower than full draft-model methods.
3. EAGLE-3 (Li et al., 2024 → 2025)
Auto-regressive draft head attached to the target. Each draft step takes the target’s previous hidden state + the previous predicted token, predicts the next. Sequential within a draft round, but very cheap because it shares the target’s hidden representations.
EAGLE-3 (2024 update with multi-step training) hits ~80% acceptance and 3× speedup on chat / instruction workloads. Currently the frontier as of April 2026.
4. Prompt Lookup Decoding (PLD) / lookahead decoding
No draft model at all. Speculate from n-gram matches in the prompt or recent output. If the model just generated “the FastAPI app” and the prompt contains “the FastAPI app routes requests…”, speculate “routes requests…” from the prompt.
Free at inference time, works best on summarization / RAG / code-completion where output strongly correlates with input. ~1.5–2× speedup on the right workloads, ~1× on creative tasks.
Real numbers — vLLM 0.7 + Llama-3.1-70B + Llama-3.2-1B draft
| Method | tok/s | Speedup | Acceptance |
|---|---|---|---|
| Baseline (no spec) | 38 | 1.0× | — |
| Vanilla spec, | 78 | 2.05× | 71% |
| Medusa-2 | 72 | 1.89× | — |
| EAGLE-3 | 108 | 2.84× | 78% |
| PLD (RAG workload) | 60 | 1.58× | n-gram dep. |
EAGLE-3 is the winner on chat workloads through 2025–2026 by a clear margin.
Run it in your browser — the math, visualized
You’ll see speedup peaks around – for common acceptance rates. Past that, additional draft tokens are unlikely to be accepted and the draft cost catches up.
Quick check
Key takeaways
- Speculative decoding is a free lunch. Same model, same quality, 2–3× decode throughput. Standard in 2026.
- EAGLE-3 is the current frontier for chat/instruction workloads — ~3× speedup, ~80% acceptance.
- Medusa is simpler but lower-ceiling — good for a quick win on a single model.
- PLD is workload-specific — RAG / summarization / code wins for free, creative tasks gain ~nothing.
- The math is simple, the kernels are not. Verify-K-tokens-in-one-forward needs custom attention kernels; this is why the technique was theoretical for years before vLLM/SGLang/TRT-LLM made it production-ready.
Go deeper
- PaperFast Inference from Transformers via Speculative DecodingOne of the two original papers (the other is Chen et al., 2023). Foundational.
- PaperMedusa: Simple LLM Inference Acceleration FrameworkThe Medusa paper. Multi-head parallel speculation.
- PaperEAGLE-3: Scaling up Inference Acceleration of Large Language ModelsFrontier through 2026. Auto-regressive draft heads with deep training.
- PaperPrompt Lookup DecodingPLD. No draft model needed. Short paper.
- DocsvLLM — speculative decoding docsHow to enable in vLLM.
- BlogFireworks AI — Multi-Token PredictionProduction lessons from running speculative decoding at scale.
- VideoYam Peleg — Speculative Decoding ExplainedBest 25-min visual walkthrough.
- RepoSafeAILab/EAGLEEAGLE-3 reference implementation.