Structured Output
Prereqs: Sampling. Constrained decoding is another mask applied between the model’s logits and the sampled token.
When you call client.beta.chat.completions.parse(response_format=Movie) against a vLLM v1 server, you’ve handed the engine a Pydantic model and asked it to make sure the output is a parseable Movie. Behind the scenes, vLLM hands the schema to — specifically, to an engine like XGrammar that compiles the schema into a finite-state machine. At every decode step, between the model producing logits and the sampler picking a token, the FSM emits a mask: every token that would violate the grammar gets logit -∞. The model literally cannot emit malformed JSON. Output isn’t probably parseable. It’s parseable by construction.
This matters because “respond in JSON” prompts still produce malformed output 5–15% of the time on current frontier models. A 5% malformed rate at 1M requests/day is 50,000 production errors. The historical answer — “ask nicely, retry on parse failure” — was always a bandage; the 2024 wave of fast grammar engines (XGrammar, Outlines v0.1, vLLM’s grammar backend) made constrained decoding fast enough that there is no longer a tradeoff. Per-step overhead is sub-millisecond on a 128K-token vocab. In 2026, the question is no longer “should I use constrained decoding?” — it’s “which engine, and at what point in my pipeline?”
TL;DR
- “Respond in JSON” prompts still produce malformed output 5–15% of the time on current frontier models. That’s a production failure, not a glitch.
- Constrained decoding compiles your schema/grammar into a finite-state machine, then at every decode step masks logits for tokens that would violate the grammar. Output is guaranteed parseable.
- XGrammar (2024) is the current speed champion: a context-free grammar engine with a token-mask cache that makes the overhead under 5% of decode time. Default in vLLM v1, SGLang, TensorRT-LLM.
- Outlines (2023) pioneered the FSM-mask approach. Slower than XGrammar in some cases; still widely used for its Python-level ergonomics.
- llama.cpp’s GBNF is the same idea for the local-LLM world. Same speed regime, same correctness guarantee.
- Don’t use it on reasoning traces. Constrained decoding hurts free-form thought; apply it only at the final-answer extraction step.
Mental model
The FSM is the memory of the constraint; the per-step token-mask is the enforcement. Sampling itself is unchanged — temperature, top-p, min-p still apply, but only over the valid tokens.
From JSON schema to FSM
A JSON schema like {"name": str, "age": int} becomes a regular language: \{"name": "[^"]*", "age": [0-9]+\} (simplified). A regular language has a deterministic finite automaton — every state knows which characters lead to which next states.
The work the engine does once at compile time:
- Parse the grammar/schema into a context-free grammar (CFG).
- Convert to a pushdown / finite automaton (depending on grammar class).
- For every (state, token) pair in the model’s vocabulary, precompute whether that token is valid. This produces a 2D mask:
mask[state, token_id] ∈ {0, -∞}.
The work the engine does at every decode step:
mask = grammar_state_to_mask[fsm_state] # cached, O(V) memory
logits += mask # vectorized; fused in production kernels
token = sample(logits, temperature, top_p, min_p)
fsm_state = grammar_advance(fsm_state, token) # O(1)Per-step cost: one vector add + one state transition. This is why it’s nearly free when the cache is warm.
Tokens, not characters
The trickiest part. The grammar is defined over characters (or bytes), but the model emits tokens — each token is an arbitrary string of bytes. A single token may match multiple grammar states (“": is one token in the Llama tokenizer that simultaneously closes a string, types a colon, and opens whitespace).
So the mask isn’t is byte b valid? — it’s is the entire byte-sequence of token t a valid extension from this FSM state?. Computing that lookup table is O(V × states × max_token_length) at compile time. For Llama-3 (128K tokens) and a moderate JSON schema, this is a few hundred milliseconds, one-time, fully cached after.
XGrammar’s headline contribution: a persistent grammar-state cache that handles ambiguous splits efficiently and shares work across requests. The cache and the precomputed token-validity table are why XGrammar’s per-step overhead is sub-millisecond on a 128K-token vocab.
Three engines, one idea
| Engine | Grammar class | Compile speed | Per-step overhead | Notes |
|---|---|---|---|---|
| Outlines | Regex / JSON schema → FSM | slow (seconds) on large schemas | 1–3 ms | Easy Python API; great for prototyping. |
| XGrammar | CFG (BNF) | fast (~100 ms) | ~0.1 ms | Default in vLLM v1. Handles full GBNF + JSON schema. |
| llama.cpp GBNF | CFG (GBNF) | fast | ~0.5 ms | Local-LLM world; same speed regime. |
| LM Format Enforcer | regex | slow | 1–5 ms | Earlier alternative; mostly subsumed. |
In production today (April 2026), XGrammar is the answer for vLLM/SGLang stacks; GBNF is the answer for llama.cpp. Outlines lives on for ergonomic reasons (Pydantic models compile cleanly to it) but is increasingly a frontend whose backend is XGrammar.
The Pydantic / instructor flow
The pattern most teams use:
from pydantic import BaseModel
from openai import OpenAI
class Movie(BaseModel):
title: str
year: int
rating: float
client = OpenAI(base_url="http://localhost:8000/v1")
resp = client.beta.chat.completions.parse(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Pick a movie I should watch tonight."}],
response_format=Movie, # Pydantic model auto-converted to JSON schema
)
movie: Movie = resp.choices[0].message.parsedUnder the hood, the server (vLLM v1, with XGrammar) compiles the JSON schema corresponding to Movie, applies the per-step mask, and returns a guaranteed-parseable response. The Pydantic instance is hydrated with no try/except.
For agentic tool use, the schema is the OpenAI function-call schema; the engine constrains the model to emit a valid function call. Same machinery, different schema.
When NOT to use it
Three cases where constrained decoding hurts:
- Reasoning traces (R1, o-series). The model needs free-form thought before the answer. Constrain only the final answer extraction, never the trace.
- Open-ended creative generation. Schemas are constraints; constraints are anti-creativity.
- When the schema doesn’t match the model’s prior. If you constrain Llama to emit a 50-field JSON it has never seen during training, the FSM keeps it on rails but quality plummets — you’ve told the model what to write but not how. Few-shot examples help.
The general rule: constrain at API boundaries, not in the middle of thought.
Run it in your browser
A toy FSM constrainer for a tiny vocab. Watch the mask shape as the FSM walks through valid JSON.
In production this same loop runs but with mask_for_state precomputed for every (state, full_vocab_token) pair, and the model is the LLM rather than rng.standard_normal. The shape is identical.
Quick check
Key takeaways
- Constrained decoding turns parse failures into impossibilities. Output is always grammar-valid by construction.
- The mechanism is a per-step token mask driven by an FSM state. Sampling is unchanged — temperature, top-p, min-p still apply over the valid set.
- XGrammar is the speed champion in 2026. Outlines pioneered the technique; XGrammar, llama.cpp’s GBNF, and LM Format Enforcer all converge on the same idea, with XGrammar fastest on large vocab.
- Tokenizer-aware masking is the hard part. The FSM operates on chars, the model emits tokens — that mismatch is the work the precomputed table absorbs.
- Don’t constrain reasoning, only final answers. Free thought first, constraint at the boundary. The two-pass setup (R1-style trace, then constrained extraction) is the production pattern.
Go deeper
- PaperXGrammar: Flexible and Efficient Structured Generation Engine for Large Language ModelsThe XGrammar paper. Section 4 has the grammar-state cache that delivers the speedup.
- PaperEfficient Guided Generation for Large Language ModelsThe Outlines paper. The original FSM-mask formulation.
- Blogdottxt — Coalescence in Constrained DecodingBest explanation of the token-vs-character mismatch and how engines solve it.
- DocsvLLM — Structured OutputsProduction docs. Backends, knobs, the response_format API.
- Docsllama.cpp — GBNF GrammarsThe local-LLM equivalent. Same idea, GBNF syntax instead of JSON schema.
- Repomlc-ai/xgrammarReference implementation. `xgrammar/grammar.py` and `xgrammar/grammar_matcher.py` are the FSM and the mask.
- Repodottxt-ai/outlinesThe pioneering library. Pydantic-first ergonomics; backends include XGrammar and llama.cpp.
Prereqs: Sampling. Constrained decoding is another mask applied between the model’s logits and the sampled token.
TL;DR
- “Respond in JSON” prompts still produce malformed output 5–15% of the time on current frontier models. That’s a production failure, not a glitch.
- Constrained decoding compiles your schema/grammar into a finite-state machine, then at every decode step masks logits for tokens that would violate the grammar. Output is guaranteed parseable.
- XGrammar (2024) is the current speed champion: a context-free grammar engine with a token-mask cache that makes the overhead under 5% of decode time. Default in vLLM v1, SGLang, TensorRT-LLM.
- Outlines (2023) pioneered the FSM-mask approach. Slower than XGrammar in some cases; still widely used for its Python-level ergonomics.
- llama.cpp’s GBNF is the same idea for the local-LLM world. Same speed regime, same correctness guarantee.
- Don’t use it on reasoning traces. Constrained decoding hurts free-form thought; apply it only at the final-answer extraction step.
Why this matters
Every agent, every tool-use stack, every JSON-out API endpoint either runs on constrained decoding or runs on prayer. The historical answer — “ask nicely, retry on parse failure” — fails at scale: a 5% malformed rate at 1M requests/day is 50,000 production errors. Constrained decoding turns parse failures from a probabilistic problem into an impossibility.
The 2024 wave (XGrammar, Outlines v0.1, vLLM’s grammar backend) made it fast enough that there’s no longer a tradeoff. In 2026, the question is no longer “should I use constrained decoding?” — it’s “which engine, and at what point in my pipeline?”
Mental model
The FSM is the memory of the constraint; the per-step token-mask is the enforcement. Sampling itself is unchanged — temperature, top-p, min-p still apply, but only over the valid tokens.
Concrete walkthrough
From JSON schema to FSM
A JSON schema like {"name": str, "age": int} becomes a regular language: \{"name": "[^"]*", "age": [0-9]+\} (simplified). A regular language has a deterministic finite automaton — every state knows which characters lead to which next states.
The work the engine does once at compile time:
- Parse the grammar/schema into a context-free grammar (CFG).
- Convert to a pushdown / finite automaton (depending on grammar class).
- For every (state, token) pair in the model’s vocabulary, precompute whether that token is valid. This produces a 2D mask:
mask[state, token_id] ∈ {0, -∞}.
The work the engine does at every decode step:
mask = grammar_state_to_mask[fsm_state] # cached, O(V) memory
logits += mask # vectorized; fused in production kernels
token = sample(logits, temperature, top_p, min_p)
fsm_state = grammar_advance(fsm_state, token) # O(1)Per-step cost: one vector add + one state transition. This is why it’s nearly free when the cache is warm.
Tokens, not characters
The trickiest part. The grammar is defined over characters (or bytes), but the model emits tokens — each token is an arbitrary string of bytes. A single token may match multiple grammar states (“": is one token in the Llama tokenizer that simultaneously closes a string, types a colon, and opens whitespace).
So the mask isn’t is byte b valid? — it’s is the entire byte-sequence of token t a valid extension from this FSM state?. Computing that lookup table is O(V × states × max_token_length) at compile time. For Llama-3 (128K tokens) and a moderate JSON schema, this is a few hundred milliseconds, one-time, fully cached after.
XGrammar’s headline contribution: a persistent grammar-state cache that handles ambiguous splits efficiently and shares work across requests. The cache and the precomputed token-validity table are why XGrammar’s per-step overhead is sub-millisecond on a 128K-token vocab.
Three engines, one idea
| Engine | Grammar class | Compile speed | Per-step overhead | Notes |
|---|---|---|---|---|
| Outlines | Regex / JSON schema → FSM | slow (seconds) on large schemas | 1–3 ms | Easy Python API; great for prototyping. |
| XGrammar | CFG (BNF) | fast (~100 ms) | ~0.1 ms | Default in vLLM v1. Handles full GBNF + JSON schema. |
| llama.cpp GBNF | CFG (GBNF) | fast | ~0.5 ms | Local-LLM world; same speed regime. |
| LM Format Enforcer | regex | slow | 1–5 ms | Earlier alternative; mostly subsumed. |
In production today (April 2026), XGrammar is the answer for vLLM/SGLang stacks; GBNF is the answer for llama.cpp. Outlines lives on for ergonomic reasons (Pydantic models compile cleanly to it) but is increasingly a frontend whose backend is XGrammar.
The Pydantic / instructor flow
The pattern most teams use:
from pydantic import BaseModel
from openai import OpenAI
class Movie(BaseModel):
title: str
year: int
rating: float
client = OpenAI(base_url="http://localhost:8000/v1")
resp = client.beta.chat.completions.parse(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Pick a movie I should watch tonight."}],
response_format=Movie, # Pydantic model auto-converted to JSON schema
)
movie: Movie = resp.choices[0].message.parsedUnder the hood, the server (vLLM v1, with XGrammar) compiles the JSON schema corresponding to Movie, applies the per-step mask, and returns a guaranteed-parseable response. The Pydantic instance is hydrated with no try/except.
For agentic tool use, the schema is the OpenAI function-call schema; the engine constrains the model to emit a valid function call. Same machinery, different schema.
When NOT to use it
Three cases where constrained decoding hurts:
- Reasoning traces (R1, o-series). The model needs free-form thought before the answer. Constrain only the final answer extraction, never the trace.
- Open-ended creative generation. Schemas are constraints; constraints are anti-creativity.
- When the schema doesn’t match the model’s prior. If you constrain Llama to emit a 50-field JSON it has never seen during training, the FSM keeps it on rails but quality plummets — you’ve told the model what to write but not how. Few-shot examples help.
The general rule: constrain at API boundaries, not in the middle of thought.
Run it in your browser
A toy FSM constrainer for a tiny vocab. Watch the mask shape as the FSM walks through valid JSON.
In production this same loop runs but with mask_for_state precomputed for every (state, full_vocab_token) pair, and the model is the LLM rather than rng.standard_normal. The shape is identical.
Quick check
Key takeaways
- Constrained decoding turns parse failures into impossibilities. Output is always grammar-valid by construction.
- The mechanism is a per-step token mask driven by an FSM state. Sampling is unchanged — temperature, top-p, min-p still apply over the valid set.
- XGrammar is the speed champion in 2026. Outlines pioneered the technique; XGrammar, llama.cpp’s GBNF, and LM Format Enforcer all converge on the same idea, with XGrammar fastest on large vocab.
- Tokenizer-aware masking is the hard part. The FSM operates on chars, the model emits tokens — that mismatch is the work the precomputed table absorbs.
- Don’t constrain reasoning, only final answers. Free thought first, constraint at the boundary. The two-pass setup (R1-style trace, then constrained extraction) is the production pattern.
Go deeper
- PaperXGrammar: Flexible and Efficient Structured Generation Engine for Large Language ModelsThe XGrammar paper. Section 4 has the grammar-state cache that delivers the speedup.
- PaperEfficient Guided Generation for Large Language ModelsThe Outlines paper. The original FSM-mask formulation.
- Blogdottxt — Coalescence in Constrained DecodingBest explanation of the token-vs-character mismatch and how engines solve it.
- DocsvLLM — Structured OutputsProduction docs. Backends, knobs, the response_format API.
- Docsllama.cpp — GBNF GrammarsThe local-LLM equivalent. Same idea, GBNF syntax instead of JSON schema.
- Repomlc-ai/xgrammarReference implementation. `xgrammar/grammar.py` and `xgrammar/grammar_matcher.py` are the FSM and the mask.
- Repodottxt-ai/outlinesThe pioneering library. Pydantic-first ergonomics; backends include XGrammar and llama.cpp.