Speculative Decoding

When you start vLLM with --speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5, the engine quietly attaches a tiny draft model alongside your big target model and starts a different decode loop. Instead of “big model produces one token, append, repeat,” it becomes “small model guesses K tokens cheaply, big model checks all K in one shot, accept what it agrees with.” On chat and code workloads this typically doubles or triples decode throughput, and — this is the part that surprises people — it produces samples from exactly the same distribution as decoding alone. Same model, same quality, more tokens per second.

The reason it works is one fact: LLM decode is memory-bandwidth-bound, not compute-bound. Each generated token reads the entire model from HBM once. The compute for that one forward is barely used — at batch size 1, the streaming multiprocessors mostly sit idle waiting for weights. spends that idle compute on something useful: verifying several candidate tokens in parallel, in a single forward pass that costs roughly the same as one regular forward (the FFN dominates; attention over K tokens is small). Same memory traffic, more output. As of April 2026 it is standard in vLLM, SGLang, TGI, and TensorRT-LLM, and the choice is no longer “should I turn it on?” but “which variant — vanilla, Medusa, EAGLE-3, or PLD — fits my workload?”

TL;DR

A small draft model generates $K$ candidate tokens. The big target model verifies all $K$ in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
The math: a single $K$ -token target forward is much cheaper than $K$ sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.

Mental model

Draft = cheap. Verify = one big-model forward over a sequence of $K$ tokens — costs roughly the same as one forward over 1 token because attention over $K$ is small relative to the FFN work that dominates.

The math (rejection sampling)

Let $p$ be the target distribution at step $i$ and $q$ be the draft’s distribution. The draft samples $\tilde t_i \sim q$ . We accept it with probability $\min(1, p(\tilde t_i) / q(\tilde t_i))$ . If rejected, we resample from the residual distribution $\max(0, p - q)$ normalized.

Crucial property: the resulting samples are exactly drawn from $p$ . The output distribution is identical to running the target alone. No quality change. This is the property that makes speculative decoding a free lunch rather than a quality tradeoff.

In practice many implementations use greedy (argmax) verification — accept iff target’s argmax matches the draft’s token. Slightly biased away from sampling but simpler and faster.

The cost model

Per “speculation round” of $K$ draft tokens (Leviathan et al., 2022, §3):

1 draft forward producing $K$ tokens: cost $\approx K \cdot c_{\text{draft}}$
1 target forward over $K$ tokens: cost $\approx c_{\text{target}}$ (parallel, big-model FFN dominates)
Expected accepted tokens per round (geometric prefix + bonus token from rejection sampling):
$\bar L = \frac{1 - \alpha^{K+1}}{1 - \alpha}$
where $\alpha$ is the per-position acceptance probability.

Speedup vs naive ≈ $\bar L \cdot c_{\text{target}} / (c_{\text{target}} + K \cdot c_{\text{draft}})$ .

For a 70B target with a 1B draft, $\alpha \approx 0.7$ , $K = 5$ :

$\bar L = (1 - 0.7^6) / (1 - 0.7) \approx 3.0$ tokens
speedup $\approx 3.0 \cdot c_{70B} / (c_{70B} + 5 \cdot c_{1B}) \approx 2.5×$

The four families

1. Vanilla speculative decoding (Leviathan et al., Chen et al., 2023)

Pair a small draft model (e.g., 1B) with a big target (e.g., 70B). Both must share the same tokenizer. ~70% acceptance rate; 2× speedup.

2. Medusa (Cai et al., 2024)

Add several small “Medusa heads” on top of the target’s last hidden state. Each head predicts a token at offset $+1, +2, \ldots, +K$ from current position in parallel. No separate draft model needed.


Last hidden state h_t → Medusa head 1 → t_{t+1}
                       → Medusa head 2 → t_{t+2}
                       → Medusa head 3 → t_{t+3}
                       ...

Verification: run target on the predicted tree, accept the longest matching path. Simpler than draft models; needs ~1B-token fine-tune for the heads. ~2× speedup, lower than full draft-model methods.

3. EAGLE-3 (Li et al., 2024 → 2025)

Auto-regressive draft head attached to the target. Each draft step takes the target’s previous hidden state + the previous predicted token, predicts the next. Sequential within a draft round, but very cheap because it shares the target’s hidden representations.

EAGLE-3 (2024 update with multi-step training) hits ~80% acceptance and 3× speedup on chat / instruction workloads. Currently the frontier as of April 2026.

4. Prompt Lookup Decoding (PLD) / lookahead decoding

No draft model at all. Speculate from n-gram matches in the prompt or recent output. If the model just generated “the FastAPI app” and the prompt contains “the FastAPI app routes requests…”, speculate “routes requests…” from the prompt.

Free at inference time, works best on summarization / RAG / code-completion where output strongly correlates with input. ~1.5–2× speedup on the right workloads, ~1× on creative tasks.

Real numbers — vLLM 0.7 + Llama-3.1-70B + Llama-3.2-1B draft

Method	tok/s	Speedup	Acceptance
Baseline (no spec)	38	1.0×	—
Vanilla spec, $K=4$	78	2.05×	71%
Medusa-2	72	1.89×	—
EAGLE-3	108	2.84×	78%
PLD (RAG workload)	60	1.58×	n-gram dep.

EAGLE-3 is the winner on chat workloads through 2025–2026 by a clear margin.

Run it in your browser — the math, visualized

Python — editableCompute expected speedup for spec decoding under various draft acceptance rates.

def speedup(K, alpha, c_draft_over_target=0.05):
  """
  K = number of draft tokens per round
  alpha = per-position acceptance probability
  c_draft_over_target = cost of one draft step relative to one target step
  Leviathan et al. 2022 §3: expected accepted tokens = (1 - alpha^(K+1)) / (1 - alpha)
  (geometric prefix + the always-sampled bonus token).
  """
  # Expected accepted tokens per round (geometric sum, K+1 positions: 0..K).
  expected_accepted = (1 - alpha ** (K + 1)) / (1 - alpha) if alpha < 1 else (K + 1)
  # Total cost per round: 1 target forward + K draft forwards.
  cost_round = 1 + K * c_draft_over_target
  return expected_accepted / cost_round

print("Speedup as a function of draft acceptance rate alpha (K = 5, draft = 5% of target cost):")
for alpha in (0.3, 0.5, 0.6, 0.7, 0.8, 0.9):
  s = speedup(K=5, alpha=alpha)
  print(f"  alpha = {alpha:.1f}  ->  {s:.2f}x")

print("\nSpeedup as a function of K (alpha = 0.7):")
for K in (1, 2, 3, 4, 5, 7, 10):
  s = speedup(K=K, alpha=0.7)
  print(f"  K = {K:>2}  ->  {s:.2f}x")

def speedup(K, alpha, c_draft_over_target=0.05):
  """
  K = number of draft tokens per round
  alpha = per-position acceptance probability
  c_draft_over_target = cost of one draft step relative to one target step
  Leviathan et al. 2022 §3: expected accepted tokens = (1 - alpha^(K+1)) / (1 - alpha)
  (geometric prefix + the always-sampled bonus token).
  """
  # Expected accepted tokens per round (geometric sum, K+1 positions: 0..K).
  expected_accepted = (1 - alpha ** (K + 1)) / (1 - alpha) if alpha < 1 else (K + 1)
  # Total cost per round: 1 target forward + K draft forwards.
  cost_round = 1 + K * c_draft_over_target
  return expected_accepted / cost_round

print("Speedup as a function of draft acceptance rate alpha (K = 5, draft = 5% of target cost):")
for alpha in (0.3, 0.5, 0.6, 0.7, 0.8, 0.9):
  s = speedup(K=5, alpha=alpha)
  print(f"  alpha = {alpha:.1f}  ->  {s:.2f}x")

print("\nSpeedup as a function of K (alpha = 0.7):")
for K in (1, 2, 3, 4, 5, 7, 10):
  s = speedup(K=K, alpha=0.7)
  print(f"  K = {K:>2}  ->  {s:.2f}x")

def speedup(K, alpha, c_draft_over_target=0.05):
  """
  K = number of draft tokens per round
  alpha = per-position acceptance probability
  c_draft_over_target = cost of one draft step relative to one target step
  Leviathan et al. 2022 §3: expected accepted tokens = (1 - alpha^(K+1)) / (1 - alpha)
  (geometric prefix + the always-sampled bonus token).
  """
  # Expected accepted tokens per round (geometric sum, K+1 positions: 0..K).
  expected_accepted = (1 - alpha ** (K + 1)) / (1 - alpha) if alpha < 1 else (K + 1)
  # Total cost per round: 1 target forward + K draft forwards.
  cost_round = 1 + K * c_draft_over_target
  return expected_accepted / cost_round

print("Speedup as a function of draft acceptance rate alpha (K = 5, draft = 5% of target cost):")
for alpha in (0.3, 0.5, 0.6, 0.7, 0.8, 0.9):
  s = speedup(K=5, alpha=alpha)
  print(f"  alpha = {alpha:.1f}  ->  {s:.2f}x")

print("\nSpeedup as a function of K (alpha = 0.7):")
for K in (1, 2, 3, 4, 5, 7, 10):
  s = speedup(K=K, alpha=0.7)
  print(f"  K = {K:>2}  ->  {s:.2f}x")

Ctrl+Enter to run

You’ll see speedup peaks around $K = 4$ – $6$ for common acceptance rates. Past that, additional draft tokens are unlikely to be accepted and the draft cost catches up.

Quick check

You're serving a 70B model and the workload is heavy chat with creative open-ended responses. EAGLE-3 gives a 2.5× decode speedup, PLD gives 1.05×. Why is PLD so much worse on this workload?

Key takeaways

Speculative decoding is a free lunch. Same model, same quality, 2–3× decode throughput. Standard in 2026.
EAGLE-3 is the current frontier for chat/instruction workloads — ~3× speedup, ~80% acceptance.
Medusa is simpler but lower-ceiling — good for a quick win on a single model.
PLD is workload-specific — RAG / summarization / code wins for free, creative tasks gain ~nothing.
The math is simple, the kernels are not. Verify-K-tokens-in-one-forward needs custom attention kernels; this is why the technique was theoretical for years before vLLM/SGLang/TRT-LLM made it production-ready.

Go deeper

PaperFast Inference from Transformers via Speculative Decoding · Leviathan, Kalman, Matias (Google, 2023)One of the two original papers (the other is Chen et al., 2023). Foundational.
PaperMedusa: Simple LLM Inference Acceleration Framework · Cai et al. (2024)The Medusa paper. Multi-head parallel speculation.
PaperEAGLE-3: Scaling up Inference Acceleration of Large Language Models · Li et al. (early 2025)Frontier through 2026. Auto-regressive draft heads with deep training.
PaperPrompt Lookup Decoding · Saxena (2024)PLD. No draft model needed. Short paper.
DocsvLLM — speculative decoding docsHow to enable in vLLM.
BlogFireworks AI — Multi-Token PredictionProduction lessons from running speculative decoding at scale.
VideoYam Peleg — Speculative Decoding ExplainedBest 25-min visual walkthrough.
RepoSafeAILab/EAGLEEAGLE-3 reference implementation.

TL;DR

A small draft model generates $K$ candidate tokens. The big target model verifies all $K$ in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
The math: a single $K$ -token target forward is much cheaper than $K$ sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.

Why this matters

LLM decode is memory-bound: each token reads the entire model from HBM. The compute for one forward is barely used (batch 1 = empty SMs). Speculative decoding turns this into a budget — use the spare compute to verify multiple candidate tokens at once. Same memory traffic, more output tokens.

The result is the only “free lunch” in modern inference: no quality loss, real throughput gain. As of April 2026 it’s standard in vLLM, SGLang, TGI, TensorRT-LLM.

Mental model

Concrete walkthrough

The math (rejection sampling)

Crucial property: the resulting samples are exactly drawn from $p$ . The output distribution is identical to running the target alone. No quality change.

In practice many implementations use greedy (argmax) verification — accept iff target’s argmax matches the draft’s token. Slightly biased away from sampling but simpler and faster.

The cost model

Per “speculation round” of $K$ draft tokens (Leviathan et al., 2022, §3):

1 draft forward producing $K$ tokens: cost $\approx K \cdot c_{\text{draft}}$
1 target forward over $K$ tokens: cost $\approx c_{\text{target}}$ (parallel, big-model FFN dominates)
Expected accepted tokens per round (geometric prefix + bonus token from rejection sampling):
$\bar L = \frac{1 - \alpha^{K+1}}{1 - \alpha}$
where $\alpha$ is the per-position acceptance probability.

Speedup vs naive ≈ $\bar L \cdot c_{\text{target}} / (c_{\text{target}} + K \cdot c_{\text{draft}})$ .

For a 70B target with a 1B draft, $\alpha \approx 0.7$ , $K = 5$ :

$\bar L = (1 - 0.7^6) / (1 - 0.7) \approx 3.0$ tokens
speedup $\approx 3.0 \cdot c_{70B} / (c_{70B} + 5 \cdot c_{1B}) \approx 2.5×$

The four families

1. Vanilla speculative decoding (Leviathan et al., Chen et al., 2023)

Pair a small draft model (e.g., 1B) with a big target (e.g., 70B). Both must share the same tokenizer. ~70% acceptance rate; 2× speedup.

2. Medusa (Cai et al., 2024)


Last hidden state h_t → Medusa head 1 → t_{t+1}
                       → Medusa head 2 → t_{t+2}
                       → Medusa head 3 → t_{t+3}
                       ...

3. EAGLE-3 (Li et al., 2024 → 2025)

EAGLE-3 (2024 update with multi-step training) hits ~80% acceptance and 3× speedup on chat / instruction workloads. Currently the frontier as of April 2026.

4. Prompt Lookup Decoding (PLD) / lookahead decoding

Free at inference time, works best on summarization / RAG / code-completion where output strongly correlates with input. ~1.5–2× speedup on the right workloads, ~1× on creative tasks.

Real numbers — vLLM 0.7 + Llama-3.1-70B + Llama-3.2-1B draft

Method	tok/s	Speedup	Acceptance
Baseline (no spec)	38	1.0×	—
Vanilla spec, $K=4$	78	2.05×	71%
Medusa-2	72	1.89×	—
EAGLE-3	108	2.84×	78%
PLD (RAG workload)	60	1.58×	n-gram dep.

EAGLE-3 is the winner on chat workloads through 2025–2026 by a clear margin.

Run it in your browser — the math, visualized

Python — editableCompute expected speedup for spec decoding under various draft acceptance rates.

def speedup(K, alpha, c_draft_over_target=0.05):
  """
  K = number of draft tokens per round
  alpha = per-position acceptance probability
  c_draft_over_target = cost of one draft step relative to one target step
  Leviathan et al. 2022 §3: expected accepted tokens = (1 - alpha^(K+1)) / (1 - alpha)
  (geometric prefix + the always-sampled bonus token).
  """
  # Expected accepted tokens per round (geometric sum, K+1 positions: 0..K).
  expected_accepted = (1 - alpha ** (K + 1)) / (1 - alpha) if alpha < 1 else (K + 1)
  # Total cost per round: 1 target forward + K draft forwards.
  cost_round = 1 + K * c_draft_over_target
  return expected_accepted / cost_round

print("Speedup as a function of draft acceptance rate alpha (K = 5, draft = 5% of target cost):")
for alpha in (0.3, 0.5, 0.6, 0.7, 0.8, 0.9):
  s = speedup(K=5, alpha=alpha)
  print(f"  alpha = {alpha:.1f}  ->  {s:.2f}x")

print("\nSpeedup as a function of K (alpha = 0.7):")
for K in (1, 2, 3, 4, 5, 7, 10):
  s = speedup(K=K, alpha=0.7)
  print(f"  K = {K:>2}  ->  {s:.2f}x")

def speedup(K, alpha, c_draft_over_target=0.05):
  """
  K = number of draft tokens per round
  alpha = per-position acceptance probability
  c_draft_over_target = cost of one draft step relative to one target step
  Leviathan et al. 2022 §3: expected accepted tokens = (1 - alpha^(K+1)) / (1 - alpha)
  (geometric prefix + the always-sampled bonus token).
  """
  # Expected accepted tokens per round (geometric sum, K+1 positions: 0..K).
  expected_accepted = (1 - alpha ** (K + 1)) / (1 - alpha) if alpha < 1 else (K + 1)
  # Total cost per round: 1 target forward + K draft forwards.
  cost_round = 1 + K * c_draft_over_target
  return expected_accepted / cost_round

print("Speedup as a function of draft acceptance rate alpha (K = 5, draft = 5% of target cost):")
for alpha in (0.3, 0.5, 0.6, 0.7, 0.8, 0.9):
  s = speedup(K=5, alpha=alpha)
  print(f"  alpha = {alpha:.1f}  ->  {s:.2f}x")

print("\nSpeedup as a function of K (alpha = 0.7):")
for K in (1, 2, 3, 4, 5, 7, 10):
  s = speedup(K=K, alpha=0.7)
  print(f"  K = {K:>2}  ->  {s:.2f}x")

def speedup(K, alpha, c_draft_over_target=0.05):
  """
  K = number of draft tokens per round
  alpha = per-position acceptance probability
  c_draft_over_target = cost of one draft step relative to one target step
  Leviathan et al. 2022 §3: expected accepted tokens = (1 - alpha^(K+1)) / (1 - alpha)
  (geometric prefix + the always-sampled bonus token).
  """
  # Expected accepted tokens per round (geometric sum, K+1 positions: 0..K).
  expected_accepted = (1 - alpha ** (K + 1)) / (1 - alpha) if alpha < 1 else (K + 1)
  # Total cost per round: 1 target forward + K draft forwards.
  cost_round = 1 + K * c_draft_over_target
  return expected_accepted / cost_round

print("Speedup as a function of draft acceptance rate alpha (K = 5, draft = 5% of target cost):")
for alpha in (0.3, 0.5, 0.6, 0.7, 0.8, 0.9):
  s = speedup(K=5, alpha=alpha)
  print(f"  alpha = {alpha:.1f}  ->  {s:.2f}x")

print("\nSpeedup as a function of K (alpha = 0.7):")
for K in (1, 2, 3, 4, 5, 7, 10):
  s = speedup(K=K, alpha=0.7)
  print(f"  K = {K:>2}  ->  {s:.2f}x")

Ctrl+Enter to run

You’ll see speedup peaks around $K = 4$ – $6$ for common acceptance rates. Past that, additional draft tokens are unlikely to be accepted and the draft cost catches up.

Quick check

You're serving a 70B model and the workload is heavy chat with creative open-ended responses. EAGLE-3 gives a 2.5× decode speedup, PLD gives 1.05×. Why is PLD so much worse on this workload?

Key takeaways

Speculative decoding is a free lunch. Same model, same quality, 2–3× decode throughput. Standard in 2026.
EAGLE-3 is the current frontier for chat/instruction workloads — ~3× speedup, ~80% acceptance.
Medusa is simpler but lower-ceiling — good for a quick win on a single model.
PLD is workload-specific — RAG / summarization / code wins for free, creative tasks gain ~nothing.
The math is simple, the kernels are not. Verify-K-tokens-in-one-forward needs custom attention kernels; this is why the technique was theoretical for years before vLLM/SGLang/TRT-LLM made it production-ready.

Go deeper

PaperFast Inference from Transformers via Speculative Decoding · Leviathan, Kalman, Matias (Google, 2023)One of the two original papers (the other is Chen et al., 2023). Foundational.
PaperMedusa: Simple LLM Inference Acceleration Framework · Cai et al. (2024)The Medusa paper. Multi-head parallel speculation.
PaperEAGLE-3: Scaling up Inference Acceleration of Large Language Models · Li et al. (early 2025)Frontier through 2026. Auto-regressive draft heads with deep training.
PaperPrompt Lookup Decoding · Saxena (2024)PLD. No draft model needed. Short paper.
DocsvLLM — speculative decoding docsHow to enable in vLLM.
BlogFireworks AI — Multi-Token PredictionProduction lessons from running speculative decoding at scale.
VideoYam Peleg — Speculative Decoding ExplainedBest 25-min visual walkthrough.
RepoSafeAILab/EAGLEEAGLE-3 reference implementation.