Skip to content

Speculative Decoding

When you start vLLM with --speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5, the engine quietly attaches a tiny draft model alongside your big target model and starts a different decode loop. Instead of “big model produces one token, append, repeat,” it becomes “small model guesses K tokens cheaply, big model checks all K in one shot, accept what it agrees with.” On chat and code workloads this typically doubles or triples decode throughput, and — this is the part that surprises people — it produces samples from exactly the same distribution as decoding alone. Same model, same quality, more tokens per second.

The reason it works is one fact: LLM decode is memory-bandwidth-bound, not compute-bound. Each generated token reads the entire model from HBM once. The compute for that one forward is barely used — at batch size 1, the streaming multiprocessors mostly sit idle waiting for weights. spends that idle compute on something useful: verifying several candidate tokens in parallel, in a single forward pass that costs roughly the same as one regular forward (the FFN dominates; attention over K tokens is small). Same memory traffic, more output. As of April 2026 it is standard in vLLM, SGLang, TGI, and TensorRT-LLM, and the choice is no longer “should I turn it on?” but “which variant — vanilla, Medusa, EAGLE-3, or PLD — fits my workload?”

TL;DR

  • A small draft model generates KK candidate tokens. The big target model verifies all KK in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
  • The math: a single KK-token target forward is much cheaper than KK sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
  • EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
  • Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
  • Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.

Mental model

Draft = cheap. Verify = one big-model forward over a sequence of KK tokens — costs roughly the same as one forward over 1 token because attention over KK is small relative to the FFN work that dominates.

The math (rejection sampling)

Let pp be the target distribution at step ii and qq be the draft’s distribution. The draft samples t~iq\tilde t_i \sim q. We accept it with probability min(1,p(t~i)/q(t~i))\min(1, p(\tilde t_i) / q(\tilde t_i)). If rejected, we resample from the residual distribution max(0,pq)\max(0, p - q) normalized.

Crucial property: the resulting samples are exactly drawn from pp. The output distribution is identical to running the target alone. No quality change. This is the property that makes speculative decoding a free lunch rather than a quality tradeoff.

In practice many implementations use greedy (argmax) verification — accept iff target’s argmax matches the draft’s token. Slightly biased away from sampling but simpler and faster.

The cost model

Per “speculation round” of KK draft tokens (Leviathan et al., 2022, §3):

  • 1 draft forward producing KK tokens: cost Kcdraft\approx K \cdot c_{\text{draft}}

  • 1 target forward over KK tokens: cost ctarget\approx c_{\text{target}} (parallel, big-model FFN dominates)

  • Expected accepted tokens per round (geometric prefix + bonus token from rejection sampling):

    Lˉ=1αK+11α\bar L = \frac{1 - \alpha^{K+1}}{1 - \alpha}

    where α\alpha is the per-position acceptance probability.

Speedup vs naive ≈ Lˉctarget/(ctarget+Kcdraft)\bar L \cdot c_{\text{target}} / (c_{\text{target}} + K \cdot c_{\text{draft}}).

For a 70B target with a 1B draft, α0.7\alpha \approx 0.7, K=5K = 5:

  • Lˉ=(10.76)/(10.7)3.0\bar L = (1 - 0.7^6) / (1 - 0.7) \approx 3.0 tokens
  • speedup 3.0c70B/(c70B+5c1B)2.5×\approx 3.0 \cdot c_{70B} / (c_{70B} + 5 \cdot c_{1B}) \approx 2.5×

The four families

1. Vanilla speculative decoding (Leviathan et al., Chen et al., 2023)

Pair a small draft model (e.g., 1B) with a big target (e.g., 70B). Both must share the same tokenizer. ~70% acceptance rate; 2× speedup.

2. Medusa (Cai et al., 2024)

Add several small “Medusa heads” on top of the target’s last hidden state. Each head predicts a token at offset +1,+2,,+K+1, +2, \ldots, +K from current position in parallel. No separate draft model needed.

Last hidden state h_t → Medusa head 1 → t_{t+1} → Medusa head 2 → t_{t+2} → Medusa head 3 → t_{t+3} ...

Verification: run target on the predicted tree, accept the longest matching path. Simpler than draft models; needs ~1B-token fine-tune for the heads. ~2× speedup, lower than full draft-model methods.

3. EAGLE-3 (Li et al., 2024 → 2025)

Auto-regressive draft head attached to the target. Each draft step takes the target’s previous hidden state + the previous predicted token, predicts the next. Sequential within a draft round, but very cheap because it shares the target’s hidden representations.

EAGLE-3 (2024 update with multi-step training) hits ~80% acceptance and 3× speedup on chat / instruction workloads. Currently the frontier as of April 2026.

4. Prompt Lookup Decoding (PLD) / lookahead decoding

No draft model at all. Speculate from n-gram matches in the prompt or recent output. If the model just generated “the FastAPI app” and the prompt contains “the FastAPI app routes requests…”, speculate “routes requests…” from the prompt.

Free at inference time, works best on summarization / RAG / code-completion where output strongly correlates with input. ~1.5–2× speedup on the right workloads, ~1× on creative tasks.

Real numbers — vLLM 0.7 + Llama-3.1-70B + Llama-3.2-1B draft

Methodtok/sSpeedupAcceptance
Baseline (no spec)381.0×
Vanilla spec, K=4K=4782.05×71%
Medusa-2721.89×
EAGLE-31082.84×78%
PLD (RAG workload)601.58×n-gram dep.

EAGLE-3 is the winner on chat workloads through 2025–2026 by a clear margin.

Run it in your browser — the math, visualized

Python — editableCompute expected speedup for spec decoding under various draft acceptance rates.
Ctrl+Enter to run

You’ll see speedup peaks around K=4K = 466 for common acceptance rates. Past that, additional draft tokens are unlikely to be accepted and the draft cost catches up.

Quick check

Quick check
You're serving a 70B model and the workload is heavy chat with creative open-ended responses. EAGLE-3 gives a 2.5× decode speedup, PLD gives 1.05×. Why is PLD so much worse on this workload?

Key takeaways

  1. Speculative decoding is a free lunch. Same model, same quality, 2–3× decode throughput. Standard in 2026.
  2. EAGLE-3 is the current frontier for chat/instruction workloads — ~3× speedup, ~80% acceptance.
  3. Medusa is simpler but lower-ceiling — good for a quick win on a single model.
  4. PLD is workload-specific — RAG / summarization / code wins for free, creative tasks gain ~nothing.
  5. The math is simple, the kernels are not. Verify-K-tokens-in-one-forward needs custom attention kernels; this is why the technique was theoretical for years before vLLM/SGLang/TRT-LLM made it production-ready.

Go deeper

TL;DR

  • A small draft model generates KK candidate tokens. The big target model verifies all KK in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
  • The math: a single KK-token target forward is much cheaper than KK sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
  • EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
  • Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
  • Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.

Why this matters

LLM decode is memory-bound: each token reads the entire model from HBM. The compute for one forward is barely used (batch 1 = empty SMs). Speculative decoding turns this into a budget — use the spare compute to verify multiple candidate tokens at once. Same memory traffic, more output tokens.

The result is the only “free lunch” in modern inference: no quality loss, real throughput gain. As of April 2026 it’s standard in vLLM, SGLang, TGI, TensorRT-LLM.

Mental model

Draft = cheap. Verify = one big-model forward over a sequence of KK tokens — costs roughly the same as one forward over 1 token because attention over KK is small relative to the FFN work that dominates.

Concrete walkthrough

The math (rejection sampling)

Let pp be the target distribution at step ii and qq be the draft’s distribution. The draft samples t~iq\tilde t_i \sim q. We accept it with probability min(1,p(t~i)/q(t~i))\min(1, p(\tilde t_i) / q(\tilde t_i)). If rejected, we resample from the residual distribution max(0,pq)\max(0, p - q) normalized.

Crucial property: the resulting samples are exactly drawn from pp. The output distribution is identical to running the target alone. No quality change.

In practice many implementations use greedy (argmax) verification — accept iff target’s argmax matches the draft’s token. Slightly biased away from sampling but simpler and faster.

The cost model

Per “speculation round” of KK draft tokens (Leviathan et al., 2022, §3):

  • 1 draft forward producing KK tokens: cost Kcdraft\approx K \cdot c_{\text{draft}}

  • 1 target forward over KK tokens: cost ctarget\approx c_{\text{target}} (parallel, big-model FFN dominates)

  • Expected accepted tokens per round (geometric prefix + bonus token from rejection sampling):

    Lˉ=1αK+11α\bar L = \frac{1 - \alpha^{K+1}}{1 - \alpha}

    where α\alpha is the per-position acceptance probability.

Speedup vs naive ≈ Lˉctarget/(ctarget+Kcdraft)\bar L \cdot c_{\text{target}} / (c_{\text{target}} + K \cdot c_{\text{draft}}).

For a 70B target with a 1B draft, α0.7\alpha \approx 0.7, K=5K = 5:

  • Lˉ=(10.76)/(10.7)3.0\bar L = (1 - 0.7^6) / (1 - 0.7) \approx 3.0 tokens
  • speedup 3.0c70B/(c70B+5c1B)2.5×\approx 3.0 \cdot c_{70B} / (c_{70B} + 5 \cdot c_{1B}) \approx 2.5×

The four families

1. Vanilla speculative decoding (Leviathan et al., Chen et al., 2023)

Pair a small draft model (e.g., 1B) with a big target (e.g., 70B). Both must share the same tokenizer. ~70% acceptance rate; 2× speedup.

2. Medusa (Cai et al., 2024)

Add several small “Medusa heads” on top of the target’s last hidden state. Each head predicts a token at offset +1,+2,,+K+1, +2, \ldots, +K from current position in parallel. No separate draft model needed.

Last hidden state h_t → Medusa head 1 → t_{t+1} → Medusa head 2 → t_{t+2} → Medusa head 3 → t_{t+3} ...

Verification: run target on the predicted tree, accept the longest matching path. Simpler than draft models; needs ~1B-token fine-tune for the heads. ~2× speedup, lower than full draft-model methods.

3. EAGLE-3 (Li et al., 2024 → 2025)

Auto-regressive draft head attached to the target. Each draft step takes the target’s previous hidden state + the previous predicted token, predicts the next. Sequential within a draft round, but very cheap because it shares the target’s hidden representations.

EAGLE-3 (2024 update with multi-step training) hits ~80% acceptance and 3× speedup on chat / instruction workloads. Currently the frontier as of April 2026.

4. Prompt Lookup Decoding (PLD) / lookahead decoding

No draft model at all. Speculate from n-gram matches in the prompt or recent output. If the model just generated “the FastAPI app” and the prompt contains “the FastAPI app routes requests…”, speculate “routes requests…” from the prompt.

Free at inference time, works best on summarization / RAG / code-completion where output strongly correlates with input. ~1.5–2× speedup on the right workloads, ~1× on creative tasks.

Real numbers — vLLM 0.7 + Llama-3.1-70B + Llama-3.2-1B draft

Methodtok/sSpeedupAcceptance
Baseline (no spec)381.0×
Vanilla spec, K=4K=4782.05×71%
Medusa-2721.89×
EAGLE-31082.84×78%
PLD (RAG workload)601.58×n-gram dep.

EAGLE-3 is the winner on chat workloads through 2025–2026 by a clear margin.

Run it in your browser — the math, visualized

Python — editableCompute expected speedup for spec decoding under various draft acceptance rates.
Ctrl+Enter to run

You’ll see speedup peaks around K=4K = 466 for common acceptance rates. Past that, additional draft tokens are unlikely to be accepted and the draft cost catches up.

Quick check

Quick check
You're serving a 70B model and the workload is heavy chat with creative open-ended responses. EAGLE-3 gives a 2.5× decode speedup, PLD gives 1.05×. Why is PLD so much worse on this workload?

Key takeaways

  1. Speculative decoding is a free lunch. Same model, same quality, 2–3× decode throughput. Standard in 2026.
  2. EAGLE-3 is the current frontier for chat/instruction workloads — ~3× speedup, ~80% acceptance.
  3. Medusa is simpler but lower-ceiling — good for a quick win on a single model.
  4. PLD is workload-specific — RAG / summarization / code wins for free, creative tasks gain ~nothing.
  5. The math is simple, the kernels are not. Verify-K-tokens-in-one-forward needs custom attention kernels; this is why the technique was theoretical for years before vLLM/SGLang/TRT-LLM made it production-ready.

Go deeper