Speculative Decoding Internals
The concept-level speculative decoding lesson explained the trick: a small draft proposes K tokens, the big model verifies them in a single forward pass, accepted tokens become output, and rejection sampling preserves the target distribution exactly. That’s a clean math story. The kernel and serving story is messier — a “single forward pass over K tokens” is straightforward to write down and far less straightforward to make fast inside a continuous-batching engine where most other requests are doing decode-1, the batch composition changes every step, and the kernel has to handle a tree of candidate paths when methods like Medusa or EAGLE-3 propose multiple options per position.
This lesson is the layer where speculative decoding goes from theoretical to production: the verifier kernel’s batching, tree attention for branching candidates, the EAGLE-3 speculator architecture, the acceptance-rate engineering that turns 60% into 80%, and the often-overlooked regime where spec decoding actively hurts — high-batch throughput workloads where the verify cost per accepted token approaches the baseline. After this you should be able to read a spec-decoding PR in vLLM or SGLang, predict its acceptance rate from the draft architecture, and know whether a given workload will gain or lose under spec decoding before benchmarking.
TL;DR
- The verifier kernel is not “K sequential decodes in parallel.” It is a single forward pass over K query tokens that all attend to the prefix KV cache plus each other (with a triangular causal mask among the K). Cost ≈ one decode forward, output is K logit distributions.
- Tree attention generalizes the verifier to multi-candidate per position. Medusa proposes a top-k tree of options; the kernel attends with a tree mask (sparse, non-contiguous) and accepts the longest matching path. The kernel is harder to write and is the main reason Medusa was slower to land in production than vanilla spec decoding.
- EAGLE-3 is the 2024–2025 frontier: a tiny auto-regressive draft head that takes the target’s hidden state plus the previous predicted token. ~80% acceptance vs ~70% for vanilla; the win is training-time, not inference-time architecture.
- Acceptance rate α has 4 levers: draft quality (the biggest), temperature (low T → high α — but lower entropy hurts diverse outputs), K (more drafts per round → diminishing returns past K=5–6), workload distribution shift (production logs vs benchmark suite).
- Spec decoding can hurt at high batch. When the engine is already at 70%+ HBM bandwidth without spec decoding, the verifier’s K-token forward is no longer “almost free” — it’s K× the bandwidth pressure with reduced effective batch (only one accepted token per request per step). At batch 64+, throughput often regresses. The decision is workload-dependent; benchmark before enabling.
The concept, in plain English
A naive “verify K tokens in parallel” reading suggests running K separate forward passes simultaneously. That’s not what happens. The verifier kernel takes one query tensor of shape [K, hidden] (the K candidate tokens), one KV cache of the prefix (everything before the candidates), and runs a single forward pass that internally lets each query position attend to the prefix and to earlier candidates. The output is K logit vectors. The K queries share the prefix KV reads — the headline savings — and the math of attention over K queries is small relative to the FFN work that dominates at small K.
This is why spec decoding’s “free verify” claim is true at small batch and gets less true as batch grows. At batch 1, the FFN reads the whole model from HBM exactly once whether you have 1 or 8 query positions; the marginal cost of the extra 7 queries is just FLOPs the SMs can do during the same HBM read. At batch 64, the FFN is already running at high HBM utilization for 64 query positions; adding 64 × K extra queries is no longer free. The K-fold expansion is the actual cost the math forgets.
Mental model — the verifier kernel
Three things to read off the diagram:
- One forward pass, K queries. Not K passes. The K queries share the prefix KV reads.
- Causal among the K queries. Position 2’s query attends to prefix + queries 0 and 1, not to query 3. This requires the attention mask to be constructed correctly inside the kernel.
- Output is K logit vectors. Not “the answer” — rejection sampling decides how many of the K to accept based on the logit distributions.
For tree-attention methods (Medusa, EAGLE-3 with tree expansion), the K candidates form a tree not a chain: position 1 has multiple candidate tokens, each leading to a different position-2 candidate. The mask becomes sparse and non-contiguous. Implementing tree attention in a single kernel was a 2024 milestone; before that, multi-candidate spec decoding was emulated with multiple forward passes and was rarely a win.
The verifier in continuous batching
A serving engine running spec decoding at scale isn’t just doing the K-token verify in isolation. It’s doing it in the middle of a continuous batch that also contains:
- Other requests doing decode-1 (no spec decoding)
- Other requests doing K-token spec verify with their own draft proposals
- Newly-admitted requests doing prefill (any number of tokens)
The kernel contract has to handle all of these in one forward pass. The query tensor is now:
total_queries = sum over all requests of (1 if decode, K if spec verify, prefill_len if prefill)Per-request metadata has to tell the attention kernel:
- Where each request’s query sub-block starts (CSR-style index)
- The KV cache pointers (block table)
- The local attention mask within the request’s queries (tree or chain or causal)
vLLM’s V1 engine and SGLang both support this natively. The metadata struct is roughly:
class ForwardBatch:
query_lens: List[int] # per request: 1, K, or prefill_len
query_start_locs: List[int] # CSR prefix sum
seq_lens: List[int] # cache length per request (incl. spec K)
block_tables: List[List[int]] # KV cache pointers per request
spec_metadata: Optional[SpecInfo] # tree masks, draft proposals
is_spec: List[bool] # per request: spec verify or normal decodeThe attention backend reads this and dispatches to the right kernel — chain attention for vanilla spec verify, tree attention for Medusa/EAGLE-3 with branching. The fact that one batch can mix all of these is what makes spec decoding production-viable; before this metadata existed, you’d run spec-decoding requests in a separate batch and lose the latency benefit.
Tree attention — the multi-candidate generalization
Vanilla spec decoding proposes K consecutive candidates: a chain. Medusa proposes K parallel candidates per position: a tree. EAGLE-3 with tree expansion proposes a tree with branching factor B at each of K positions, giving up to B^K candidate paths.
A 4-level tree with branching factor 2:
[position 0: t_a]
/ \
[position 1: t_b] [position 1: t_c]
/ \ / \
[pos 2: t_d] [pos 2: t_e] ... ...Up to 2^4 = 16 paths through this tree are simultaneously verifiable in one forward pass, but only the longest accepted path is emitted. The win: you get to pick the best path of many, raising effective acceptance rate.
The kernel cost: the attention mask is now a tree mask. Position-2 candidate t_d attends to its parent path (root → t_a → t_b) but NOT to sibling subtrees (t_b’s other children, t_c’s subtree). This is non-contiguous attention — every query has a different attention scope.
Mask for tree above (1 = attend, 0 = mask out):
prefix t_a t_b1 t_b2 t_c1 t_c2 t_d ...
prefix: causal 0 0 0 0 0 0
t_a: all 1 0 0 0 0 0
t_b1: all 1 1 0 0 0 0
t_b2: all 1 0 1 0 0 0
t_c1: all 1 0 0 1 0 0
t_c2: all 1 0 0 0 1 0
...Implementations:
- vLLM: tree-attention path in
vllm/v1/attention/backends/triton.pyand FlashInfer integration. Tree size is bounded; common config is K=5 with branching factor 2 (≤32 paths). - SGLang: similar tree-attention support; integrates with EAGLE-3 reference implementation.
- TRT-LLM: tree attention via
Medusaplugin; their kernel uses warp-specialized tree mask handling. - FlashInfer: the Hopper-tuned attention library that vLLM/SGLang use for fast tree attention. Reading its source is the best path for kernel-level work here.
EAGLE-3 — the architecture that won 2024–2025
EAGLE-3 (Li et al., March 2025) is the current frontier for chat workloads. The architecture in three sentences:
- The draft is a single transformer layer added on top of the target model’s last hidden state.
- Each draft step takes the target’s hidden state at position t and the previously-drafted token, predicts position t+1’s token. Auto-regressive within a draft round.
- Multi-step training — the draft is trained on multi-token prediction with the target’s hidden states as supervision, not just next-token cross-entropy on text.
The key fact: EAGLE-3’s draft sees the target’s hidden state. That’s much higher-fidelity input than a separate small model that has to re-encode the prefix from scratch. The result is dramatically higher acceptance — ~80% on chat workloads vs ~70% for vanilla spec with a small standalone draft.
Implementation details that matter:
- The draft adds ~5% of target FLOPs per step. Cheap.
- Training requires target hidden states as labels, generated offline (one pass through training data).
- The draft can be combined with tree expansion — produce top-2 at each step, get a 2^K tree, and pick the best path. Adds another ~5–10% acceptance rate.
- EAGLE-3’s main win over EAGLE-2 is multi-step training — training the draft to predict positions t+1, t+2, t+3 jointly rather than just t+1. This corrects the auto-regressive distribution drift over a draft round.
For contribution: EAGLE-3 reference code is at SafeAILab/EAGLE on GitHub. Adding EAGLE-3 support for a new model architecture is a tractable PR target — copy the reference draft head, fine-tune on your target, integrate with vLLM/SGLang’s spec-decoding hooks.
Acceptance-rate engineering — the four levers
Per-position acceptance probability α is the single number that controls expected speedup. The math (geometric expected length) is:
expected_accepted = (1 - α^(K+1)) / (1 - α)What moves α in production:
Lever 1: Draft model quality (biggest)
A smarter draft is more often right. EAGLE-3 (~80%) beats vanilla 1B-draft (~70%) almost entirely because EAGLE-3’s draft sees the target’s hidden state. Within a family:
- 1B vanilla draft for 70B target: α ≈ 0.65–0.75
- 8B vanilla draft for 70B target: α ≈ 0.75–0.85, but draft cost rises to ~12% of target — speedup actually drops
- EAGLE-3 (target’s own hidden state): α ≈ 0.78–0.85, draft cost ~5% — best of both
- Medusa heads: α per-head ≈ 0.6–0.7, but parallel so no draft serial cost — wins on simple workloads
The rule: pick the draft that maximizes α / draft_cost, not raw α. EAGLE-3 wins because its draft cost is small.
Lever 2: Temperature
At T=0 (greedy), α is highest because both target and draft converge to the same argmax tokens. At T=1.0 (sampling), α drops 5–15% because the target samples different tokens than the draft proposes. At T=2.0 (high-creativity), α drops further.
This matters for production: chat with temperature 0.7 has lower α than the benchmark numbers (which often use T=0). Real chat speedups on EAGLE-3 are 2.3–2.7×, not the 3× the paper reports.
Lever 3: K (draft length per round)
Longer drafts mean more candidates verified per forward, but the further you draft the lower α^K becomes. Sweet spot is K=4–6 for α ≈ 0.7. Beyond K=6, the K-th token has acceptance probability α^K ≈ 12% — barely worth the verifier’s extra column.
K=2, α=0.7: expected accepted = 1.51, speedup ≈ 1.4×
K=4, α=0.7: expected accepted = 2.62, speedup ≈ 2.3×
K=6, α=0.7: expected accepted = 3.27, speedup ≈ 2.6×
K=8, α=0.7: expected accepted = 3.66, speedup ≈ 2.7×
K=10, α=0.7: expected accepted = 3.91, speedup ≈ 2.65× (verifier cost catches up)vLLM defaults to K=5; SGLang defaults to K=4. Both expose the knob; the right K is workload-dependent.
Lever 4: Workload distribution shift
A draft trained on filtered web text gets ~80% acceptance on a benchmark that uses similar text. Real production traffic — code in 12 languages, tool-calling JSON, multi-turn chat in Spanish, structured-output prompts — sees α drop 5–15 points. The benchmarks lie. Always re-measure on production logs after enabling.
When spec decoding hurts — the high-batch regime
The “free verify” property holds when the target is HBM-bound at the batch you’re serving. At batch 1, the model reads its weights once per token; the verifier’s K queries are along for the ride. At batch 64, the model is already reading its weights once per token efficiently; the K-fold expansion of the query tensor is no longer free.
Specifically, spec decoding adds three costs at high batch:
- K× larger attention queries: the FFN dominates at small batch but attention scales with batch × query_length. At batch 64 and K=5, the attention cost is 320× a single decode’s attention.
- Reduced effective batch: only one accepted token per request per step. If you served 64 independent requests at 100% throughput pre-spec, you serve 64 ×
expected_accepted / Krequests post-spec. Ifexpected_acceptedis 3.0 and K is 5, that’s 38 effective requests. - Variance: some requests accept all K, some accept 0. The batch advances at the slowest rate. Scheduling has to handle the variance.
The result: at high batch (~64+), spec decoding often regresses throughput by 10–30%. The break-even batch depends on draft cost and α; rules of 2026:
| Workload | Batch ≤ 8 | Batch 16 | Batch 32 | Batch 64+ |
|---|---|---|---|---|
| EAGLE-3 chat | +2.5× | +1.8× | +1.2× | -10–20% |
| Vanilla 1B draft | +2.0× | +1.4× | +1.0× | -15–30% |
| Medusa | +1.8× | +1.3× | -5% | -25% |
| PLD on RAG | +1.6× | +1.4× | +1.1× | -5% |
The implication: enable spec decoding for low-batch latency-sensitive serving; disable it for high-throughput batched inference. Production deployments often run two pools — a low-batch pool with spec decoding for chat, a high-batch pool without for batch jobs.
Production integration — vLLM and SGLang specifics
vLLM
# Vanilla spec decoding with a separate draft model
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.2-1B-Instruct",
num_speculative_tokens=5,
use_v2_block_manager=True,
)
# EAGLE-3 (with EAGLE checkpoint)
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_config={"method": "eagle3", "model": "yuhuili/EAGLE-LLaMA3-70B"},
num_speculative_tokens=5,
)vLLM’s spec-decoding code lives in:
vllm/spec_decode/(V0 engine)vllm/v1/spec_decode/(V1 engine, the live target)
Key files: eagle.py, medusa.py, mlp_speculator.py, ngram.py (PLD). Each implements a SpeculativeProposer that produces candidate tokens, plus a SpeculativeVerifier that runs the verify-K-tokens forward and applies rejection sampling. Adding a new spec-decoding method = implement these two interfaces.
SGLang
runtime = sgl.Runtime(
model_path="meta-llama/Llama-3.1-70B-Instruct",
speculative_algorithm="EAGLE3",
speculative_draft_model_path="yuhuili/EAGLE-LLaMA3-70B",
speculative_num_steps=5,
speculative_eagle_topk=8, # tree branching factor at each step
speculative_num_draft_tokens=64, # total tree size
)SGLang’s tree-attention path is exposed more directly than vLLM’s. The speculative_eagle_topk and speculative_num_draft_tokens knobs control tree shape; tuning these against your workload’s α distribution is a measurable ~5–10% perf lever.
Code lives in python/sglang/srt/speculative/. The eagle_worker.py and eagle_utils.py files are the main contribution surface.
The contribution surface
For Year-1 OSS work, spec decoding has three high-leverage targets:
- New spec methods: a new draft architecture (e.g., the next post-EAGLE-3 paper), a new tree expansion strategy, a new way to combine spec decoding with quantization. Each is a 200–500 LOC PR.
- Per-architecture EAGLE support: the reference EAGLE training and integration are Llama-shaped. Porting to Qwen, DeepSeek, Mistral, Gemma is a tractable PR — copy the existing pattern, train the draft (3–5 hours on 1 H100), test acceptance.
- Tree-attention kernel improvements: FlashInfer’s tree mask is general; specialized variants for fixed tree shapes can be 20–40% faster. Triton implementations of branching-factor-2 trees are an active area.
The PRs that get cited by maintainers are the ones with concrete acceptance-rate measurements on a real workload. Always ship the production-traffic-shaped benchmark, not just the paper-replicated number.
Run it in your browser — predict speedup including high-batch regime
You’ll see EAGLE-3 lands ~2.7× at batch 1, drops to ~1.5× at batch 16, and crosses 1.0× somewhere around batch 32–64 — the regime where spec decoding stops helping. PLD has a smaller peak but holds longer because there’s no draft cost. The model is rough but captures the qualitative shape that production benchmarks always reproduce.
Quick check
Key takeaways
- The verifier is one forward pass over K queries, not K passes. Cost is dominated by the FFN; attention over K is cheap until batch saturates HBM.
- Tree attention generalizes the verifier to multi-candidate paths. Sparse, non-contiguous masks; FlashInfer’s tree-mask kernel is the production reference.
- EAGLE-3 wins by giving the draft access to the target’s hidden state. ~80% acceptance, ~5% draft cost. The architectural insight is the win.
- Acceptance rate has four levers: draft quality (biggest), temperature, K, workload distribution. Production α is always 5–15 points lower than benchmark α.
- Spec decoding hurts at high batch. K-fold query expansion stops being free once the GPU is HBM-saturated. Run two pools — spec-on for latency, spec-off for throughput.
Go deeper
- PaperEAGLE-3: Scaling up Inference Acceleration of Large Language ModelsThe frontier paper through 2026. Read sections 3 (architecture) and 4 (multi-step training) carefully.
- RepoSafeAILab/EAGLE — Reference ImplementationThe training and inference code. Read draft.py and eagle_utils.py before any EAGLE PR.
- PaperSpecInfer: Accelerating Generative LLM Serving with Tree-based Speculative InferenceThe paper that introduced production-grade tree attention for spec decoding. The kernel reasoning is durable.
- RepoFlashInfer — Hopper-tuned attention libraryThe kernel layer vLLM and SGLang both use. Read tree_attention/ for the production kernel implementation.
- DocsvLLM — Speculative Decoding ConfigurationThe user-facing knobs and their defaults.
- RepovLLM V1 Spec-Decoding SourceThe implementation. eagle.py, medusa.py, mlp_speculator.py, ngram.py — one file per method.
- BlogFireworks AI — Multi-Token Prediction in ProductionProduction lessons from running spec decoding at scale, including the high-batch regression problem.
- PaperMedusa: Simple LLM Inference Acceleration FrameworkThe Medusa paper. Read for the parallel-heads pattern and tree attention motivation.
TL;DR
- The verifier kernel is not “K sequential decodes in parallel.” It is a single forward pass over K query tokens that all attend to the prefix KV cache plus each other (with a triangular causal mask among the K). Cost ≈ one decode forward, output is K logit distributions.
- Tree attention generalizes the verifier to multi-candidate per position. Medusa proposes a top-k tree of options; the kernel attends with a tree mask (sparse, non-contiguous) and accepts the longest matching path. The kernel is harder to write and is the main reason Medusa was slower to land in production than vanilla spec decoding.
- EAGLE-3 is the 2024–2025 frontier: a tiny auto-regressive draft head that takes the target’s hidden state plus the previous predicted token. ~80% acceptance vs ~70% for vanilla; the win is training-time, not inference-time architecture.
- Acceptance rate α has 4 levers: draft quality (the biggest), temperature (low T → high α — but lower entropy hurts diverse outputs), K (more drafts per round → diminishing returns past K=5–6), workload distribution shift (production logs vs benchmark suite).
- Spec decoding can hurt at high batch. When the engine is already at 70%+ HBM bandwidth without spec decoding, the verifier’s K-token forward is no longer “almost free” — it’s K× the bandwidth pressure with reduced effective batch (only one accepted token per request per step). At batch 64+, throughput often regresses. The decision is workload-dependent; benchmark before enabling.
Why this matters
The concept-level lesson teaches the math. The internals lesson is what separates “I read the paper” from “I shipped a perf-cited spec-decoding PR.” Two questions every spec-decoding PR review will probe: (1) does your benchmark cover both low and high batch regimes? and (2) what’s your α distribution on production-like workloads, not just the paper benchmark? Engineers who can answer both are rare; that’s the gap this lesson closes.
For Year-1 OSS work, spec decoding is one of the highest-leverage contribution surfaces: new drafts (EAGLE successors are coming), per-architecture EAGLE ports (Qwen / DeepSeek / Mistral are tractable), and tree-attention kernel improvements (FlashInfer is the active layer). Each PR is sub-500-LOC if the design is right.
Mental model
Verifier kernel — exact contract
The verifier takes:
queries: [K, num_heads, head_dim]
prefix_k_cache: [num_blocks, num_kv_heads, head_dim, block_size]
prefix_v_cache: [num_blocks, num_kv_heads, head_dim, block_size]
block_table: [num_blocks_per_seq]
seq_len: n (prefix length)
attn_mask: [K, K] triangular causal, OR sparse tree maskAnd produces:
out: [K, num_heads, head_dim]
↓ projection + sampling
logits: [K, vocab]The K queries attend to the prefix (via block_table) and to each other (via attn_mask). For chain attention (vanilla spec), the K-K mask is lower-triangular causal. For tree attention, it’s the sparse tree-structure mask.
Continuous batching contract
class ForwardBatch:
query_lens: List[int] # 1, K, or prefill_len per request
query_start_locs: List[int] # CSR prefix sum
seq_lens: List[int] # cache length incl. spec K
block_tables: List[List[int]] # KV cache pointers
spec_metadata: Optional[SpecInfo] # tree masks, draft proposals
is_spec: List[bool] # spec verify or normal decodeThe attention backend reads this, dispatches to the right kernel per request, runs one fused forward. Decode-1, prefill, and spec-verify all coexist in the same batch.
Tree attention — mask structure
For an EAGLE-3 tree of depth K with branching factor B:
- Total candidate tokens:
1 + B + B^2 + ... + B^K = (B^(K+1) - 1) / (B - 1) - Each candidate’s attention scope: prefix + its parent path
- Mask is sparse:
~K * tree_sizenon-zero entries vstree_size^2for dense
Production configurations (rough):
| Config | K | B | Tree size | Speedup ceiling |
|---|---|---|---|---|
| Vanilla chain | 5 | 1 | 5 | 2.5× |
| EAGLE-3 chain | 5 | 1 | 5 | 2.7× |
| EAGLE-3 small tree | 5 | 2 | 31 | 2.9× |
| EAGLE-3 big tree | 5 | 4 | 341 | 3.1× (diminishing) |
| Medusa (parallel) | — | — | K heads × top-k | 1.8–2.0× |
Trees beyond ~64 candidates get expensive; the kernel’s mask handling cost catches up.
EAGLE-3 — implementation specifics
# Draft architecture (simplified)
class EagleDraftLayer(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.attn = MultiHeadAttention(hidden_dim, num_heads=...)
self.ffn = FeedForward(hidden_dim, intermediate_dim=...)
self.token_embed = Embedding(vocab_size, hidden_dim)
def forward(self, target_hidden, prev_token):
# Take target's last hidden state + previous predicted token
# Predict next token's logits
token_emb = self.token_embed(prev_token)
x = target_hidden + token_emb
x = self.attn(x, ...) + x
x = self.ffn(x) + x
return self.lm_head(x) # shares the target's lm_head weightsTraining:
- Generate target’s hidden states for a corpus (one pass through training data).
- Train EagleDraftLayer on multi-step prediction (predict t+1, t+2, t+3 jointly).
- Loss: cross-entropy on each predicted position, weighted to emphasize correctness over depth.
Compute: training takes ~3–5 H100-hours per billion tokens of corpus. The output is a small (~1B for a 70B target) draft that lives alongside the target at inference.
Acceptance-rate engineering — full table
| Lever | Effect on α | Implementation knob |
|---|---|---|
| Vanilla 1B draft → EAGLE-3 | +5–10% | Switch method |
| Multi-step training (EAGLE-3) | +3–5% over EAGLE-2 | Training-time |
| Temperature 0.0 → 0.7 | -5–10% | Sampling param |
| Temperature 0.7 → 1.5 | -10–20% | Sampling param |
| K = 4 → 6 | Marginal change | num_speculative_tokens |
| Tree branching B = 1 → 2 | +5–10% | speculative_eagle_topk |
| Tree branching B = 2 → 4 | +2–5% | (diminishing) |
| Train on production logs vs web | +5–15% | Training-time |
| Domain mismatch (English → Spanish) | -10–20% | Fundamental |
The biggest production move: train EAGLE on logs from your actual workload. Most papers train on FineWeb / RedPajama; production traffic often differs.
High-batch regime — quantitative breakdown
Below the HBM saturation point, the verifier cost is approximately the cost of one decode forward (the FFN dominates, attention over K is small). Above saturation, attention scales with batch × query_len and dominates.
Empirical break-even for EAGLE-3 on 70B fp16, H100:
| Batch | Spec speedup | Throughput vs no-spec |
|---|---|---|
| 1 | 2.85× | 2.85× |
| 4 | 2.50× | 2.50× |
| 8 | 2.15× | 2.15× |
| 16 | 1.65× | 1.65× |
| 24 | 1.30× | 1.30× |
| 32 | 1.10× | 1.10× |
| 48 | 0.95× | 0.95× |
| 64 | 0.82× | 0.82× |
| 96 | 0.72× | 0.72× |
| 128 | 0.65× | 0.65× |
Break-even around batch 40–48. Production deployments serving high-throughput batched inference disable spec decoding above this threshold.
vLLM and SGLang code paths
vLLM (V1 engine)
vllm/v1/spec_decode/
├── eagle.py # EAGLE-3 proposer + verifier integration
├── medusa.py # Medusa heads
├── mlp_speculator.py # IBM-style MLP speculator
├── ngram.py # PLD / prompt lookup
└── interfaces.py # SpeculativeProposer, SpeculativeVerifier ABCsTo add a new method: implement SpeculativeProposer.propose(batch) -> List[List[int]] and verifier integration in the worker.
SGLang
python/sglang/srt/speculative/
├── eagle_worker.py # EAGLE-3 worker (most active)
├── eagle_utils.py # tree expansion, mask construction
├── spec_info.py # per-request spec metadata
└── topk_selection.py # top-k sampling for tree expansionSGLang’s tree-attention path is the production reference for EAGLE-3 with tree expansion. The eagle_utils.py mask construction is well-commented.
Quick check
Key takeaways
- Verifier = one forward over K queries. Cost ≈ one decode at small batch; K× attention at high batch.
- Tree attention generalizes to multi-candidate paths. FlashInfer is the production kernel reference.
- EAGLE-3 wins via target hidden state + multi-step training. ~80% α at ~5% draft cost.
- α has four levers: draft quality, temperature, K, workload distribution. Production α is 5–15 points below benchmark α.
- Spec decoding helps below ~batch 32, hurts above. Two pools — spec-on for latency, spec-off for throughput.
Go deeper
- PaperEAGLE-3: Scaling up Inference Acceleration
- RepoSafeAILab/EAGLE Reference Implementation
- PaperSpecInfer: Tree-based Speculative Inference
- RepoFlashInfer — Hopper-tuned Attention Library
- DocsvLLM Speculative Decoding Configuration
- RepovLLM V1 Spec-Decoding Source
- BlogFireworks AI — Multi-Token Prediction in Production
- PaperMedusa: Simple LLM Inference Acceleration