Structured Output
Prereq: Structured Output (LLM Architecture) covers what the model does. This lesson covers what you do as the application engineer.
When you write response_format=Movie on an OpenAI client call — handing the SDK a Pydantic class and getting back a parsed Movie instance — three separate pieces of machinery line up to make that work. The class becomes a JSON Schema. The schema goes to a server that compiles it into a finite-state machine and uses to mask logits during generation, so the model literally cannot emit malformed JSON. The bytes come back as text, get parsed into a dict, then run through Pydantic’s runtime validation to catch anything semantically off — year=1850 when you said Field(ge=1900).
This is the applied/engineer-side complement to the LLM-architecture lesson on structured output — that one is what the model is doing under the hood; this one is the production stack you sit on top of it. Pre-2024 the standard solution was “ask nicely + try/except + retry.” That works at 90% reliability and breaks scaling. Modern stacks make 100% structural correctness the default — schema in, validated object out, no retries. Knowing the layered architecture is what lets you build agents that don’t fall over.
TL;DR
- Reliable structured output requires three layers: schema definition (Pydantic / TypeBox / Zod) → constrained decoding on the server (XGrammar, Outlines) → validation + retry on the client (Instructor, structured-outputs APIs).
- Pydantic is the universal frontend. Define your schema as a Python class; libraries (instructor, openai’s
response_format, anthropic’s tool-use, vllm) consume it directly. - The server doing constrained decoding is what matters most. Without it, you’re trusting the model to produce valid JSON ~95% of the time, retrying the rest. With it, output is guaranteed.
- For 2026 production: vLLM v1 / SGLang / TensorRT-LLM ship XGrammar; OpenAI / Anthropic APIs enforce schemas natively. Default to
response_format=PydanticClass; don’t roll your own JSON parsing. - Schemas should be simple, flat, well-named. Deeply-nested schemas with many enums confuse small models; LLM-friendly schemas align with how training data was shaped.
Mental model
Three checkpoints — schema, decoding, validation. Get one right and you’re at 95%; all three and you’re at 100%.
Pydantic + an LLM API
from pydantic import BaseModel
from openai import OpenAI
class MovieRecommendation(BaseModel):
title: str
year: int
rating: float
reasoning: str
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}],
response_format=MovieRecommendation,
)
movie: MovieRecommendation = response.choices[0].message.parsed
print(movie.title, movie.rating)response_format=MovieRecommendation does three things: extracts the JSON Schema, sends it to the API, the API enforces it via constrained decoding, and the result is validated + parsed into the Pydantic instance. No try/except, no retry.
Anthropic’s tool-use shape
import anthropic
from pydantic import BaseModel
class MovieRecommendation(BaseModel):
title: str
year: int
rating: float
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4",
max_tokens=1024,
tools=[{
"name": "recommend_movie",
"description": "Recommend a movie",
"input_schema": MovieRecommendation.model_json_schema(),
}],
tool_choice={"type": "tool", "name": "recommend_movie"},
messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}],
)
# Result is in response.content[0].input as a dict
movie = MovieRecommendation.model_validate(response.content[0].input)Different surface, same machinery. The “tool” is actually a way to specify “force the model to produce output matching this schema.”
Instructor — the universal wrapper
Instructor wraps every major LLM API behind one Pydantic-first interface:
import instructor
from openai import OpenAI
from pydantic import BaseModel
client = instructor.from_openai(OpenAI())
class Movie(BaseModel):
title: str
year: int
movie = client.chat.completions.create(
model="gpt-4o",
response_model=Movie,
messages=[...],
)
# movie is a Movie instance — already parsed and validatedSame code works against Anthropic, Cohere, Google, Mistral, locally-served vLLM/SGLang/llama.cpp. Instructor handles the API-specific shape; you write Pydantic.
vLLM-side enforcement
When self-hosting:
from vllm import LLM, SamplingParams
from outlines.serve.vllm import RegexLogitsProcessor, JSONLogitsProcessor
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
class Movie(BaseModel):
title: str
year: int
processor = JSONLogitsProcessor(Movie, llm.get_tokenizer())
params = SamplingParams(logits_processors=[processor], temperature=0.7)
out = llm.generate(prompts, params)
# out[0].outputs[0].text is guaranteed JSON matching MovievLLM v1 ships XGrammar as the default constrained-decoding backend; the same response_format shape from OpenAI works against vLLM with a v1+ engine config flag.
Schema design — what works and what doesn’t
Works:
class Order(BaseModel):
customer_name: str
items: list[str]
total_usd: float
shipping_address: strFlat, descriptive names, simple types. Models produce this at 99%+ reliability.
Hurts:
class Address(BaseModel):
street: str
city: str
state: str
zip: str
class CustomerProfile(BaseModel):
name: str
addresses: dict[str, Address] # nested with arbitrary keys
preferences: list[dict[str, list[str]]] # nested unstructuredDeeply-nested schemas with arbitrary keys and dict[str, ...] are doable but reduce reliability. Small models (≤7B) really struggle. Flatten when you can.
Helpful: rich descriptions:
class Movie(BaseModel):
title: str = Field(description="The movie title, in English")
year: int = Field(description="Release year between 1900 and 2026")
rating: float = Field(ge=0, le=10, description="Critic rating from 0 to 10")Descriptions get into the JSON Schema and into the model’s context. Small models especially benefit.
Retry on validation failure (the rare case)
Even with constrained decoding, semantic validation can fail — the JSON is structurally valid but year is 1850 (out of your Field(ge=1900) range). For these:
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
client = instructor.from_openai(OpenAI())
class Movie(BaseModel):
year: int = Field(ge=1900, le=2026)
movie = client.chat.completions.create(
model="gpt-4o",
response_model=Movie,
max_retries=3, # instructor handles validation-failure retries
messages=[...],
)max_retries=3 automatically re-prompts the model with the validation error on failure. Typically a 2nd attempt succeeds. Clean integration with Pydantic validation rules.
Streaming structured output
A long Pydantic object can be streamed:
from instructor import partial
from pydantic import BaseModel
class StoryReview(BaseModel):
title: str
summary: str
rating: int
pros: list[str]
cons: list[str]
partial_review = client.chat.completions.create_partial(
model="gpt-4o",
response_model=StoryReview,
messages=[...],
)
async for delta in partial_review:
print(delta) # incrementally-filled StoryReview as it generatesEach yield is a partial Pydantic instance with as many fields as have been generated so far. Useful for UIs that want to show fields as they fill in.
When NOT to use structured output
For the model’s reasoning trace (R1-style, o-series), don’t constrain — that hurts quality. Constrain only the final answer extraction. The two-pass pattern:
# Pass 1: free-form reasoning
reasoning = client.chat.completions.create(model="o3", messages=[...])
# Pass 2: structured extraction from the reasoning
final = client.beta.chat.completions.parse(
model="gpt-4o",
response_format=Answer,
messages=[..., {"role": "assistant", "content": reasoning.content}],
)This gives you the best of both: free reasoning, structured deliverable. Default for any reasoning-heavy product.
Run it in your browser — schema-shaped JSON validator
The pipeline shape — schema → JSON → validate → retry on semantic-only failures — is the entire structured-output stack in miniature.
Quick check
Key takeaways
- Three layers: Pydantic schema → constrained decoding → Pydantic validation.
response_format=PydanticClassis the canonical OpenAI shape;tools=[...]is Anthropic’s; Instructor unifies them.- Schema design matters: flat, descriptive, with
Field(description=...). Avoid deep nesting and arbitrary dict keys. - Don’t constrain reasoning traces; constrain only the final-answer extraction step.
- Streaming partial Pydantic objects is supported by Instructor — great for progressive UIs.
Go deeper
- DocsOpenAI — Structured OutputsAuthoritative on response_format. The "Schema" requirements section is essential.
- DocsAnthropic — Tool UseHow Claude does structured output via the tools shape.
- DocsInstructor DocumentationThe Pydantic-first wrapper around every LLM API. The "Production-Grade" section covers retries and streaming.
- DocsPydantic v2 DocumentationThe schema layer. The "Models" + "Fields" pages cover everything you need.
- DocsvLLM — Structured OutputsServer-side flag for XGrammar; production knobs.
- Repojxnl/instructorThe library. Read examples/ for production patterns.
- Repodottxt-ai/outlinesThe original FSM-based constrained-decoding library; still widely used.
Prereq: Structured Output (LLM Architecture) covers what the model does. This lesson covers what you do as the application engineer.
TL;DR
- Reliable structured output requires three layers: schema definition (Pydantic / TypeBox / Zod) → constrained decoding on the server (XGrammar, Outlines) → validation + retry on the client (Instructor, structured-outputs APIs).
- Pydantic is the universal frontend. Define your schema as a Python class; libraries (instructor, openai’s
response_format, anthropic’s tool-use, vllm) consume it directly. - The server doing constrained decoding is what matters most. Without it, you’re trusting the model to produce valid JSON ~95% of the time, retrying the rest. With it, output is guaranteed.
- For 2026 production: vLLM v1 / SGLang / TensorRT-LLM ship XGrammar; OpenAI / Anthropic APIs enforce schemas natively. Default to
response_format=PydanticClass; don’t roll your own JSON parsing. - Schemas should be simple, flat, well-named. Deeply-nested schemas with many enums confuse small models; LLM-friendly schemas align with how training data was shaped.
Why this matters
Every agent / tool-use / RAG-with-citations / data-extraction product needs structured output. Pre-2024 the standard solution was “ask nicely + try/except + retry.” That works at 90% reliability and breaks scaling. Modern stacks make 100% structural correctness the default — schema in, validated object out, no retries. Knowing the layered architecture is what lets you build agents that don’t fall over.
Mental model
Three checkpoints — schema, decoding, validation. Get one right and you’re at 95%; all three and you’re at 100%.
Concrete walkthrough
Pydantic + an LLM API
from pydantic import BaseModel
from openai import OpenAI
class MovieRecommendation(BaseModel):
title: str
year: int
rating: float
reasoning: str
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}],
response_format=MovieRecommendation,
)
movie: MovieRecommendation = response.choices[0].message.parsed
print(movie.title, movie.rating)response_format=MovieRecommendation does three things: extracts the JSON Schema, sends it to the API, the API enforces it via constrained decoding, and the result is validated + parsed into the Pydantic instance. No try/except, no retry.
Anthropic’s tool-use shape
import anthropic
from pydantic import BaseModel
class MovieRecommendation(BaseModel):
title: str
year: int
rating: float
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4",
max_tokens=1024,
tools=[{
"name": "recommend_movie",
"description": "Recommend a movie",
"input_schema": MovieRecommendation.model_json_schema(),
}],
tool_choice={"type": "tool", "name": "recommend_movie"},
messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}],
)
# Result is in response.content[0].input as a dict
movie = MovieRecommendation.model_validate(response.content[0].input)Different surface, same machinery. The “tool” is actually a way to specify “force the model to produce output matching this schema.”
Instructor — the universal wrapper
Instructor wraps every major LLM API behind one Pydantic-first interface:
import instructor
from openai import OpenAI
from pydantic import BaseModel
client = instructor.from_openai(OpenAI())
class Movie(BaseModel):
title: str
year: int
movie = client.chat.completions.create(
model="gpt-4o",
response_model=Movie,
messages=[...],
)
# movie is a Movie instance — already parsed and validatedSame code works against Anthropic, Cohere, Google, Mistral, locally-served vLLM/SGLang/llama.cpp. Instructor handles the API-specific shape; you write Pydantic.
vLLM-side enforcement
When self-hosting:
from vllm import LLM, SamplingParams
from outlines.serve.vllm import RegexLogitsProcessor, JSONLogitsProcessor
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")
class Movie(BaseModel):
title: str
year: int
processor = JSONLogitsProcessor(Movie, llm.get_tokenizer())
params = SamplingParams(logits_processors=[processor], temperature=0.7)
out = llm.generate(prompts, params)
# out[0].outputs[0].text is guaranteed JSON matching MovievLLM v1 ships XGrammar as the default constrained-decoding backend; the same response_format shape from OpenAI works against vLLM with a v1+ engine config flag.
Schema design — what works and what doesn’t
Works:
class Order(BaseModel):
customer_name: str
items: list[str]
total_usd: float
shipping_address: strFlat, descriptive names, simple types. Models produce this at 99%+ reliability.
Hurts:
class Address(BaseModel):
street: str
city: str
state: str
zip: str
class CustomerProfile(BaseModel):
name: str
addresses: dict[str, Address] # nested with arbitrary keys
preferences: list[dict[str, list[str]]] # nested unstructuredDeeply-nested schemas with arbitrary keys and dict[str, ...] are doable but reduce reliability. Small models (≤7B) really struggle. Flatten when you can.
Helpful: rich descriptions:
class Movie(BaseModel):
title: str = Field(description="The movie title, in English")
year: int = Field(description="Release year between 1900 and 2026")
rating: float = Field(ge=0, le=10, description="Critic rating from 0 to 10")Descriptions get into the JSON Schema and into the model’s context. Small models especially benefit.
Retry on validation failure (the rare case)
Even with constrained decoding, semantic validation can fail — the JSON is structurally valid but year is 1850 (out of your Field(ge=1900) range). For these:
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
client = instructor.from_openai(OpenAI())
class Movie(BaseModel):
year: int = Field(ge=1900, le=2026)
movie = client.chat.completions.create(
model="gpt-4o",
response_model=Movie,
max_retries=3, # instructor handles validation-failure retries
messages=[...],
)max_retries=3 automatically re-prompts the model with the validation error on failure. Typically a 2nd attempt succeeds. Clean integration with Pydantic validation rules.
Streaming structured output
A long Pydantic object can be streamed:
from instructor import partial
from pydantic import BaseModel
class StoryReview(BaseModel):
title: str
summary: str
rating: int
pros: list[str]
cons: list[str]
partial_review = client.chat.completions.create_partial(
model="gpt-4o",
response_model=StoryReview,
messages=[...],
)
async for delta in partial_review:
print(delta) # incrementally-filled StoryReview as it generatesEach yield is a partial Pydantic instance with as many fields as have been generated so far. Useful for UIs that want to show fields as they fill in.
When NOT to use structured output
For the model’s reasoning trace (R1-style, o-series), don’t constrain — that hurts quality. Constrain only the final answer extraction. The two-pass pattern:
# Pass 1: free-form reasoning
reasoning = client.chat.completions.create(model="o3", messages=[...])
# Pass 2: structured extraction from the reasoning
final = client.beta.chat.completions.parse(
model="gpt-4o",
response_format=Answer,
messages=[..., {"role": "assistant", "content": reasoning.content}],
)This gives you the best of both: free reasoning, structured deliverable. Default for any reasoning-heavy product.
Run it in your browser — schema-shaped JSON validator
The pipeline shape — schema → JSON → validate → retry on semantic-only failures — is the entire structured-output stack in miniature.
Quick check
Key takeaways
- Three layers: Pydantic schema → constrained decoding → Pydantic validation.
response_format=PydanticClassis the canonical OpenAI shape;tools=[...]is Anthropic’s; Instructor unifies them.- Schema design matters: flat, descriptive, with
Field(description=...). Avoid deep nesting and arbitrary dict keys. - Don’t constrain reasoning traces; constrain only the final-answer extraction step.
- Streaming partial Pydantic objects is supported by Instructor — great for progressive UIs.
Go deeper
- DocsOpenAI — Structured OutputsAuthoritative on response_format. The "Schema" requirements section is essential.
- DocsAnthropic — Tool UseHow Claude does structured output via the tools shape.
- DocsInstructor DocumentationThe Pydantic-first wrapper around every LLM API. The "Production-Grade" section covers retries and streaming.
- DocsPydantic v2 DocumentationThe schema layer. The "Models" + "Fields" pages cover everything you need.
- DocsvLLM — Structured OutputsServer-side flag for XGrammar; production knobs.
- Repojxnl/instructorThe library. Read examples/ for production patterns.
- Repodottxt-ai/outlinesThe original FSM-based constrained-decoding library; still widely used.