Speculative Decoding
In a cloud serving stack, decode latency is mostly somebody else’s problem — you pick a tier of GPUs, scale horizontally, and call it a day. The token-generation loop runs faster than your network round-trip can stream the output, and “the model” is usually waiting on you, not the other way around.
On a phone, the bottleneck is structural. A 7B model at Q4 reads ~4 GB of weights from memory for every token. Phone RAM bandwidth is ~50 GB/s. So your peak decode rate is ~12 tok/s, regardless of how clever your kernels are. You’re not compute-bound — you’re memory-bandwidth-bound. Throwing more compute at the problem buys nothing.
Speculative decoding is the technique that breaks this floor without changing the model. The idea: a tiny draft model generates a guess at the next 4 tokens cheaply; the target model verifies all 4 in one forward pass, accepting any prefix that matches its own argmax. Because that one forward pass moves the same 4 GB of weights as decoding 1 token, you get 4 tokens for the price of ~1.1. The output is bit-identical to running the target alone — Leviathan et al. (Google, 2022) proved this with the right hyperparameters. It is, somewhat absurdly, free performance.
TL;DR
- Speculative decoding makes an LLM generate tokens 2–4× faster without changing the output. The trick: a small draft model proposes K tokens; the target model verifies all K in one parallel forward pass; the longest accepted prefix is the new context. Tokens are bit-identical to running the target alone.
- The math: instead of one forward per token at the slow model, you do one forward per K candidates — and the target model is bandwidth-bound on a phone, so verifying K tokens costs almost the same as decoding 1.
- Three families: classic draft-target (Llama-3.2-1B → Llama-3.2-7B), self-speculative (Medusa, EAGLE — the model has additional heads that draft from itself), and lookahead (no draft model; an n-gram cache predicts).
- On a phone the typical speedup is 1.8–2.5× for matched-family pairs (1B drafting 7B), with acceptance rate 60–80% on chat workloads. Higher on code (where output is predictable), lower on creative writing.
- llama.cpp ships speculative decoding as a first-class API (
llama_speculative_*); it’s a few lines of code on top of an existing setup. Outputs are bit-identical to non-speculative.
Why this matters
A 7B model on a phone runs at ~5 tok/s — readable but slow. Speculative decoding bumps that to ~10 tok/s without changing the model, the quality, or the memory budget meaningfully. It’s the only way to get real-time 7B chat on phones in 2026 that doesn’t require a smaller model.
The 2026 reality: every serious local-LLM runtime (llama.cpp, MLX, Ollama, vLLM-on-Apple-Silicon) has speculative decoding. The technique is mature, the bit-identical guarantee makes it production-safe, and the speedup is consistent. Not knowing how this works in 2026 is like not knowing how works.
Mental model
Per-step bookkeeping:
- The draft model proposes K=4–8 tokens autoregressively (cheap; small model is fast).
- The target model runs one forward pass over all K. This is the critical insight: a single forward pass over K tokens costs nearly the same as a single forward over 1 token, because it’s memory-bandwidth-bound, not FLOP-bound.
- For each draft token, compare its position’s argmax against the draft’s pick. The first divergence position is where you stop accepting.
- The target’s distribution at the divergence position is used to sample the replacement token — preserving the exact distribution as if you’d run the target alone.
The math (Leviathan et al., Google 2022) proves the output distribution is identical. Not “approximately the same” — bit-identical with the right hyperparameters.
Why one target forward over K tokens is “free”
A transformer forward pass moves all the weights from RAM to GPU/CPU caches once per pass. The compute (FLOPs) scales with the sequence length and is small relative to the memory cost on a phone.
Roughly:
| Phase | Memory traffic | Compute |
|---|---|---|
| 1 token decode at 7B Q4 | ~4 GB read | ~7 G FLOPs |
| 4 token verify at 7B Q4 | ~4 GB read | ~28 G FLOPs |
The 4-token verify reads the same 4 GB of weights — just runs them on a 4× longer sequence. On phone hardware with low FLOPs/byte ratios, this is barely slower than decoding 1 token. You get 4 tokens of inference for ~1.1× the cost of 1 token.
If even 60% of those 4 tokens get accepted (typical for chat), that’s 2.4 tokens output per target call — ~2× speedup.
Acceptance rate is the whole game
The expected speedup is:
where is the per-position acceptance probability and is the draft-to-target cost ratio.
Practical numbers:
- Llama-3.2-1B drafting Llama-3.2-7B: on chat, .
- K=4: speedup ≈ 2.0×
- K=8: speedup ≈ 2.4× (diminishing returns; long drafts get rejected more)
- Codes (predictable next tokens): → 2.5–3.0×.
- Creative writing: → 1.4–1.6×.
Picking a draft model
Three rules:
- Same family. Llama draft for Llama target; Phi draft for Phi target; Qwen draft for Qwen. Different families have different tokenizers, different prior distributions — acceptance rate craters.
- 5–10% of target params. Llama-3.2-1B for Llama-3-7B (14%), Llama-3.2-3B for Llama-3-70B (4%). Smaller draft = faster but lower acceptance; bigger draft = higher acceptance but draft cost eats the savings.
- Same finetune. If your target is
Llama-3-7B-Instruct, your draft should beLlama-3.2-1B-Instruct, not the base model. The instruction tuning shifts the output distribution; matching it bumps by 0.1–0.15.
Self-speculative: Medusa and EAGLE
What if you don’t want to ship two models?
Medusa (Cai et al., 2024) adds extra heads to the target model that predict tokens N+2, N+3, N+4. Each head is small (~0.5% of model params). Train the heads briefly on the target’s own outputs. Now the target itself drafts.
EAGLE-2 (Li et al., 2024) goes further: a small auto-regressive head that uses the target’s hidden states to draft, achieving 80%+ acceptance on chat. EAGLE-3 (Li et al., 2025) refines the recipe with multi-step training and dynamic draft trees, pushing acceptance rates higher and achieving 3–4× speedup with no separate draft model.
Both add ~3–10% to model size and require fine-tuning of the draft heads (a few hours on a single GPU). Output is bit-identical to vanilla decoding.
For phone deployment, EAGLE-2 / EAGLE-3 are the right pick when memory is tight (one model, not two). Use a separate draft model when memory is comfortable and you can pick a high-quality off-the-shelf small model.
Lookahead — no draft model at all
Lookahead decoding (Fu et al., 2024) uses an n-gram cache populated from the target model’s own past output. No draft model, no extra training. Works best when the output has internal repetition — citations, code with repeated identifier patterns, structured text. On free-form chat the speedup is modest (1.2–1.5×); on code it can hit 2× without any extra weights.
Use lookahead when you genuinely cannot afford to ship a second model or train heads. It’s the minimum-effort variant.
llama.cpp speculative API
The host-side glue is C — what runs on the phone. The verifier kernel is the same matmul as ordinary decode (CUDA/Metal/Vulkan, depending on backend), just on K rows instead of 1:
// Load both target and draft.
llama_model* target_model = llama_model_load_from_file("Llama-3-7B-Instruct.Q4_K_M.gguf", ...);
llama_model* draft_model = llama_model_load_from_file("Llama-3.2-1B-Instruct.Q4_K_M.gguf", ...);
llama_context* target_ctx = llama_new_context_with_model(target_model, ...);
llama_context* draft_ctx = llama_new_context_with_model(draft_model, ...);
// Run speculative decoding (the API does the loop internally).
struct llama_speculative_params spec_p = llama_speculative_default_params();
spec_p.n_draft = 4; // K (number of draft tokens)
spec_p.p_accept = 0.4; // sampling acceptance threshold
llama_speculative* spec = llama_speculative_init(target_ctx, draft_ctx, spec_p);
while (n_generated < n_predict) {
llama_token tok = llama_speculative_next(spec, /* sampling params */);
if (tok == EOS) break;
print_token(tok);
n_generated++;
}The bit-identicalness is verified — llama.cpp has unit tests asserting that speculative output matches non-speculative output character-for-character at temperature 0.
The verifier kernel itself, when targeting NVIDIA, is a CUDA C++ matmul over K rows (the draft tokens) computed in parallel:
// Sketch: target-model verifier kernel (CUDA C++).
// One block computes logits for one of the K draft positions.
__global__ void verify_logits(const half* __restrict__ hidden_states, // [K, D]
const half* __restrict__ output_W, // [V, D]
half* __restrict__ logits, // [K, V]
int K, int D, int V) {
int k = blockIdx.x; // which of K draft positions
int v = blockIdx.y * blockDim.x + threadIdx.x;
if (v >= V) return;
float acc = 0.0f;
for (int d = 0; d < D; d++) {
acc += __half2float(hidden_states[k * D + d]) * __half2float(output_W[v * D + d]);
}
logits[k * V + v] = __float2half(acc);
}That’s the verification step in one tile. The point is the same as the table above: on bandwidth-bound hardware, dispatching across K rows costs almost what dispatching across 1 row does — the weights only get read once.
When NOT to use speculative decoding
- Temperature over ~1.0: the target’s high-temp distribution is broad enough that draft predictions almost never match. Acceptance rate plummets.
- Beam search: speculative is incompatible with beam search (it’s a sampling-side optimization).
- You’re memory-pinned: shipping a second 1B model costs ~600 MB of phone RAM. If you’re already squeezing the 7B in, dropping to 3B is the better fix.
- Output is short (single sentence answers): the speculative-decoding overhead per first token may exceed the savings.
For chat-style multi-turn 256+ token responses at temperature 0–1, speculative is essentially always a win.
Run it in your browser
A useful demo: simulate the speculative-decoding speedup math. Tweak acceptance rate and K to see when speculative wins and when it doesn’t.
The math makes the K hyperparameter selection concrete: pick K based on workload, not by tuning.
Quick check
Key takeaways
- Speculative decoding makes a 7B feel 2× faster on a phone with bit-identical output. Free performance.
- The math works because target verification of K tokens costs ~1.1× a single token on phone hardware (memory-bandwidth-bound).
- Acceptance rate (α) is the whole game — same family + same fine-tune + 5–10% draft size = α ≈ 0.75 = ~2× speedup.
- K=4–6 is the chat sweet spot; K=6–8 for code; K=4 or skip for creative writing.
- Self-speculative variants (Medusa, EAGLE-2/EAGLE-3) ship one model with extra heads instead of two models — better when memory is tight.
- Lookahead is the no-draft, no-training fallback; modest speedup, zero overhead.
- llama.cpp ships this natively (
llama_speculative_*); it’s a few lines on top of an existing chat app.
Go deeper
- PaperFast Inference from Transformers via Speculative DecodingThe original paper. The bit-identical proof is the load-bearing math.
- PaperMedusa: Simple LLM Inference Acceleration Framework with Multiple Decoding HeadsSelf-speculative — extra heads on the target model. The "no separate draft model" path.
- PaperEAGLE-2: Faster Inference of Language Models with Dynamic Draft TreesThe dynamic-draft-tree refinement; widely deployed.
- PaperEAGLE-3: Scaling up Inference Acceleration of Large Language ModelsCurrent frontier — multi-step training, 80%+ acceptance, ~3× speedup on production workloads.
- PaperLookahead Decoding: A Parallel Decoding AlgorithmNo draft model. Reads target model's own n-gram cache.
- BlogSpeculative Decoding for 2× Faster Whisper InferenceSame trick applied to Whisper — speech recognition is just another transformer.
- Docsllama.cpp speculative exampleThe reference implementation in C. ~300 lines, the canonical pattern.
- VideoSpeculative Decoding: Practical ImplementationWalks the algorithm on a real Llama setup. Shows the acceptance-rate behavior in production.
TL;DR
- Speculative decoding makes an LLM generate tokens 2–4× faster without changing the output. The trick: a small draft model proposes K tokens; the target model verifies all K in one parallel forward pass; the longest accepted prefix is the new context. Tokens are bit-identical to running the target alone.
- The math: instead of one forward per token at the slow model, you do one forward per K candidates — and the target model is bandwidth-bound on a phone, so verifying K tokens costs almost the same as decoding 1.
- Three families: classic draft-target (Llama-3.2-1B → Llama-3.2-7B), self-speculative (Medusa, EAGLE — the model has additional heads that draft from itself), and lookahead (no draft model; an n-gram cache predicts).
- On a phone the typical speedup is 1.8–2.5× for matched-family pairs (1B drafting 7B), with acceptance rate 60–80% on chat workloads. Higher on code (where output is predictable), lower on creative writing.
- llama.cpp ships speculative decoding as a first-class API (
llama_speculative_*); it’s a few lines of code on top of an existing setup. Outputs are bit-identical to non-speculative.
Why this matters
A 7B model on a phone runs at ~5 tok/s — readable but slow. Speculative decoding bumps that to ~10 tok/s without changing the model, the quality, or the memory budget meaningfully. It’s the only way to get real-time 7B chat on phones in 2026 that doesn’t require a smaller model.
The 2026 reality: every serious local-LLM runtime (llama.cpp, MLX, Ollama, vLLM-on-Apple-Silicon) has speculative decoding. The technique is mature, the bit-identical guarantee makes it production-safe, and the speedup is consistent. Not knowing how this works in 2026 is like not knowing how KV-caching works.
Mental model
Per-step bookkeeping:
- The draft model proposes K=4–8 tokens autoregressively (cheap; small model is fast).
- The target model runs one forward pass over all K. This is the critical insight: a single forward pass over K tokens costs nearly the same as a single forward over 1 token, because it’s memory-bandwidth-bound, not FLOP-bound.
- For each draft token, compare its position’s argmax against the draft’s pick. The first divergence position is where you stop accepting.
- The target’s distribution at the divergence position is used to sample the replacement token — preserving the exact distribution as if you’d run the target alone.
The math (Leviathan et al., Google 2022) proves the output distribution is identical. Not “approximately the same” — bit-identical with the right hyperparameters.
Concrete walkthrough
Why one target forward over K tokens is “free”
A transformer forward pass moves all the weights from RAM to GPU/CPU caches once per pass. The compute (FLOPs) scales with the sequence length and is small relative to the memory cost on a phone.
Roughly:
| Phase | Memory traffic | Compute |
|---|---|---|
| 1 token decode at 7B Q4 | ~4 GB read | ~7 G FLOPs |
| 4 token verify at 7B Q4 | ~4 GB read | ~28 G FLOPs |
The 4-token verify reads the same 4 GB of weights — just runs them on a 4× longer sequence. On phone hardware with low FLOPs/byte ratios, this is barely slower than decoding 1 token. You get 4 tokens of inference for ~1.1× the cost of 1 token.
If even 60% of those 4 tokens get accepted (typical for chat), that’s 2.4 tokens output per target call — ~2× speedup.
Acceptance rate is the whole game
The expected speedup is:
where is the per-position acceptance probability and is the draft-to-target cost ratio.
Practical numbers:
- Llama-3.2-1B drafting Llama-3.2-7B: on chat, .
- K=4: speedup ≈ 2.0×
- K=8: speedup ≈ 2.4× (diminishing returns; long drafts get rejected more)
- Codes (predictable next tokens): → 2.5–3.0×.
- Creative writing: → 1.4–1.6×.
Picking a draft model
Three rules:
- Same family. Llama draft for Llama target; Phi draft for Phi target; Qwen draft for Qwen. Different families have different tokenizers, different prior distributions — acceptance rate craters.
- 5–10% of target params. Llama-3.2-1B for Llama-3-7B (14%), Llama-3.2-3B for Llama-3-70B (4%). Smaller draft = faster but lower acceptance; bigger draft = higher acceptance but draft cost eats the savings.
- Same finetune. If your target is
Llama-3-7B-Instruct, your draft should beLlama-3.2-1B-Instruct, not the base model. The instruction tuning shifts the output distribution; matching it bumps by 0.1–0.15.
Self-speculative: Medusa and EAGLE
What if you don’t want to ship two models?
Medusa (Cai et al., 2024) adds extra heads to the target model that predict tokens N+2, N+3, N+4. Each head is small (~0.5% of model params). Train the heads briefly on the target’s own outputs. Now the target itself drafts.
EAGLE-2 (Li et al., 2024) goes further: a small auto-regressive head that uses the target’s hidden states to draft, achieving 80%+ acceptance on chat. EAGLE-3 (Li et al., 2025) refines the recipe with multi-step training and dynamic draft trees, pushing acceptance rates higher and achieving 3–4× speedup with no separate draft model.
Both add ~3–10% to model size and require fine-tuning of the draft heads (a few hours on a single GPU). Output is bit-identical to vanilla decoding.
For phone deployment, EAGLE-2 / EAGLE-3 are the right pick when memory is tight (one model, not two). Use a separate draft model when memory is comfortable and you can pick a high-quality off-the-shelf small model.
Lookahead — no draft model at all
Lookahead decoding (Fu et al., 2024) uses an n-gram cache populated from the target model’s own past output. No draft model, no extra training. Works best when the output has internal repetition — citations, code with repeated identifier patterns, structured text. On free-form chat the speedup is modest (1.2–1.5×); on code it can hit 2× without any extra weights.
Use lookahead when you genuinely cannot afford to ship a second model or train heads. It’s the minimum-effort variant.
llama.cpp speculative API
// Load both target and draft.
llama_model* target_model = llama_model_load_from_file("Llama-3-7B-Instruct.Q4_K_M.gguf", ...);
llama_model* draft_model = llama_model_load_from_file("Llama-3.2-1B-Instruct.Q4_K_M.gguf", ...);
llama_context* target_ctx = llama_new_context_with_model(target_model, ...);
llama_context* draft_ctx = llama_new_context_with_model(draft_model, ...);
// Run speculative decoding (the API does the loop internally).
struct llama_speculative_params spec_p = llama_speculative_default_params();
spec_p.n_draft = 4; // K (number of draft tokens)
spec_p.p_accept = 0.4; // sampling acceptance threshold
llama_speculative* spec = llama_speculative_init(target_ctx, draft_ctx, spec_p);
while (n_generated < n_predict) {
llama_token tok = llama_speculative_next(spec, /* sampling params */);
if (tok == EOS) break;
print_token(tok);
n_generated++;
}The bit-identicalness is verified — llama.cpp has unit tests asserting that speculative output matches non-speculative output character-for-character at temperature 0.
When NOT to use speculative decoding
- Temperature over ~1.0: the target’s high-temp distribution is broad enough that draft predictions almost never match. Acceptance rate plummets.
- Beam search: speculative is incompatible with beam search (it’s a sampling-side optimization).
- You’re memory-pinned: shipping a second 1B model costs ~600 MB of phone RAM. If you’re already squeezing the 7B in, dropping to 3B is the better fix.
- Output is short (single sentence answers): the speculative-decoding overhead per first token may exceed the savings.
For chat-style multi-turn 256+ token responses at temperature 0–1, speculative is essentially always a win.
Run it in your browser
A useful demo: simulate the speculative-decoding speedup math. Tweak acceptance rate and K to see when speculative wins and when it doesn’t.
The math makes the K hyperparameter selection concrete: pick K based on workload, not by tuning.
Quick check
Key takeaways
- Speculative decoding makes a 7B feel 2× faster on a phone with bit-identical output. Free performance.
- The math works because target verification of K tokens costs ~1.1× a single token on phone hardware (memory-bandwidth-bound).
- Acceptance rate (α) is the whole game — same family + same fine-tune + 5–10% draft size = α ≈ 0.75 = ~2× speedup.
- K=4–6 is the chat sweet spot; K=6–8 for code; K=4 or skip for creative writing.
- Self-speculative variants (Medusa, EAGLE-2/EAGLE-3) ship one model with extra heads instead of two models — better when memory is tight.
- Lookahead is the no-draft, no-training fallback; modest speedup, zero overhead.
- llama.cpp ships this natively (
llama_speculative_*); it’s a few lines on top of an existing chat app.
Go deeper
- PaperFast Inference from Transformers via Speculative DecodingThe original paper. The bit-identical proof is the load-bearing math.
- PaperMedusa: Simple LLM Inference Acceleration Framework with Multiple Decoding HeadsSelf-speculative — extra heads on the target model. The "no separate draft model" path.
- PaperEAGLE-2: Faster Inference of Language Models with Dynamic Draft TreesThe dynamic-draft-tree refinement; widely deployed.
- PaperEAGLE-3: Scaling up Inference Acceleration of Large Language ModelsCurrent frontier — multi-step training, 80%+ acceptance, ~3× speedup on production workloads.
- PaperLookahead Decoding: A Parallel Decoding AlgorithmNo draft model. Reads target model's own n-gram cache.
- BlogSpeculative Decoding for 2× Faster Whisper InferenceSame trick applied to Whisper — speech recognition is just another transformer.
- Docsllama.cpp speculative exampleThe reference implementation in C. ~300 lines, the canonical pattern.
- VideoSpeculative Decoding: Practical ImplementationWalks the algorithm on a real Llama setup. Shows the acceptance-rate behavior in production.