Skip to content

Structured Output

Prereq: Structured Output (LLM Architecture) covers what the model does. This lesson covers what you do as the application engineer.

When you write response_format=Movie on an OpenAI client call — handing the SDK a Pydantic class and getting back a parsed Movie instance — three separate pieces of machinery line up to make that work. The class becomes a JSON Schema. The schema goes to a server that compiles it into a finite-state machine and uses to mask logits during generation, so the model literally cannot emit malformed JSON. The bytes come back as text, get parsed into a dict, then run through Pydantic’s runtime validation to catch anything semantically off — year=1850 when you said Field(ge=1900).

This is the applied/engineer-side complement to the LLM-architecture lesson on structured output — that one is what the model is doing under the hood; this one is the production stack you sit on top of it. Pre-2024 the standard solution was “ask nicely + try/except + retry.” That works at 90% reliability and breaks scaling. Modern stacks make 100% structural correctness the default — schema in, validated object out, no retries. Knowing the layered architecture is what lets you build agents that don’t fall over.

TL;DR

  • Reliable structured output requires three layers: schema definition (Pydantic / TypeBox / Zod) → constrained decoding on the server (XGrammar, Outlines) → validation + retry on the client (Instructor, structured-outputs APIs).
  • Pydantic is the universal frontend. Define your schema as a Python class; libraries (instructor, openai’s response_format, anthropic’s tool-use, vllm) consume it directly.
  • The server doing constrained decoding is what matters most. Without it, you’re trusting the model to produce valid JSON ~95% of the time, retrying the rest. With it, output is guaranteed.
  • For 2026 production: vLLM v1 / SGLang / TensorRT-LLM ship XGrammar; OpenAI / Anthropic APIs enforce schemas natively. Default to response_format=PydanticClass; don’t roll your own JSON parsing.
  • Schemas should be simple, flat, well-named. Deeply-nested schemas with many enums confuse small models; LLM-friendly schemas align with how training data was shaped.

Mental model

Three checkpoints — schema, decoding, validation. Get one right and you’re at 95%; all three and you’re at 100%.

Pydantic + an LLM API

from pydantic import BaseModel from openai import OpenAI class MovieRecommendation(BaseModel): title: str year: int rating: float reasoning: str client = OpenAI() response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}], response_format=MovieRecommendation, ) movie: MovieRecommendation = response.choices[0].message.parsed print(movie.title, movie.rating)

response_format=MovieRecommendation does three things: extracts the JSON Schema, sends it to the API, the API enforces it via constrained decoding, and the result is validated + parsed into the Pydantic instance. No try/except, no retry.

Anthropic’s tool-use shape

import anthropic from pydantic import BaseModel class MovieRecommendation(BaseModel): title: str year: int rating: float client = anthropic.Anthropic() response = client.messages.create( model="claude-sonnet-4", max_tokens=1024, tools=[{ "name": "recommend_movie", "description": "Recommend a movie", "input_schema": MovieRecommendation.model_json_schema(), }], tool_choice={"type": "tool", "name": "recommend_movie"}, messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}], ) # Result is in response.content[0].input as a dict movie = MovieRecommendation.model_validate(response.content[0].input)

Different surface, same machinery. The “tool” is actually a way to specify “force the model to produce output matching this schema.”

Instructor — the universal wrapper

Instructor  wraps every major LLM API behind one Pydantic-first interface:

import instructor from openai import OpenAI from pydantic import BaseModel client = instructor.from_openai(OpenAI()) class Movie(BaseModel): title: str year: int movie = client.chat.completions.create( model="gpt-4o", response_model=Movie, messages=[...], ) # movie is a Movie instance — already parsed and validated

Same code works against Anthropic, Cohere, Google, Mistral, locally-served vLLM/SGLang/llama.cpp. Instructor handles the API-specific shape; you write Pydantic.

vLLM-side enforcement

When self-hosting:

from vllm import LLM, SamplingParams from outlines.serve.vllm import RegexLogitsProcessor, JSONLogitsProcessor llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct") class Movie(BaseModel): title: str year: int processor = JSONLogitsProcessor(Movie, llm.get_tokenizer()) params = SamplingParams(logits_processors=[processor], temperature=0.7) out = llm.generate(prompts, params) # out[0].outputs[0].text is guaranteed JSON matching Movie

vLLM v1 ships XGrammar as the default constrained-decoding backend; the same response_format shape from OpenAI works against vLLM with a v1+ engine config flag.

Schema design — what works and what doesn’t

Works:

class Order(BaseModel): customer_name: str items: list[str] total_usd: float shipping_address: str

Flat, descriptive names, simple types. Models produce this at 99%+ reliability.

Hurts:

class Address(BaseModel): street: str city: str state: str zip: str class CustomerProfile(BaseModel): name: str addresses: dict[str, Address] # nested with arbitrary keys preferences: list[dict[str, list[str]]] # nested unstructured

Deeply-nested schemas with arbitrary keys and dict[str, ...] are doable but reduce reliability. Small models (≤7B) really struggle. Flatten when you can.

Helpful: rich descriptions:

class Movie(BaseModel): title: str = Field(description="The movie title, in English") year: int = Field(description="Release year between 1900 and 2026") rating: float = Field(ge=0, le=10, description="Critic rating from 0 to 10")

Descriptions get into the JSON Schema and into the model’s context. Small models especially benefit.

Retry on validation failure (the rare case)

Even with constrained decoding, semantic validation can fail — the JSON is structurally valid but year is 1850 (out of your Field(ge=1900) range). For these:

import instructor from openai import OpenAI from pydantic import BaseModel, Field client = instructor.from_openai(OpenAI()) class Movie(BaseModel): year: int = Field(ge=1900, le=2026) movie = client.chat.completions.create( model="gpt-4o", response_model=Movie, max_retries=3, # instructor handles validation-failure retries messages=[...], )

max_retries=3 automatically re-prompts the model with the validation error on failure. Typically a 2nd attempt succeeds. Clean integration with Pydantic validation rules.

Streaming structured output

A long Pydantic object can be streamed:

from instructor import partial from pydantic import BaseModel class StoryReview(BaseModel): title: str summary: str rating: int pros: list[str] cons: list[str] partial_review = client.chat.completions.create_partial( model="gpt-4o", response_model=StoryReview, messages=[...], ) async for delta in partial_review: print(delta) # incrementally-filled StoryReview as it generates

Each yield is a partial Pydantic instance with as many fields as have been generated so far. Useful for UIs that want to show fields as they fill in.

When NOT to use structured output

For the model’s reasoning trace (R1-style, o-series), don’t constrain — that hurts quality. Constrain only the final answer extraction. The two-pass pattern:

# Pass 1: free-form reasoning reasoning = client.chat.completions.create(model="o3", messages=[...]) # Pass 2: structured extraction from the reasoning final = client.beta.chat.completions.parse( model="gpt-4o", response_format=Answer, messages=[..., {"role": "assistant", "content": reasoning.content}], )

This gives you the best of both: free reasoning, structured deliverable. Default for any reasoning-heavy product.

Run it in your browser — schema-shaped JSON validator

Python — editableHand-rolled validator with the same shape as Pydantic — no library dependency.
Ctrl+Enter to run

The pipeline shape — schema → JSON → validate → retry on semantic-only failures — is the entire structured-output stack in miniature.

Quick check

Fill in the blank
The Python library that wraps every major LLM API behind a Pydantic-first interface (response_model=YourClass):
Single word; named after the role of guiding the LLM.
Quick check
A team's chat product needs to output recommendations as structured JSON. They prompt 'respond in JSON' and parse with json.loads. ~5% of responses fail. Best fix:

Key takeaways

  1. Three layers: Pydantic schema → constrained decoding → Pydantic validation.
  2. response_format=PydanticClass is the canonical OpenAI shape; tools=[...] is Anthropic’s; Instructor unifies them.
  3. Schema design matters: flat, descriptive, with Field(description=...). Avoid deep nesting and arbitrary dict keys.
  4. Don’t constrain reasoning traces; constrain only the final-answer extraction step.
  5. Streaming partial Pydantic objects is supported by Instructor — great for progressive UIs.

Go deeper

Prereq: Structured Output (LLM Architecture) covers what the model does. This lesson covers what you do as the application engineer.

TL;DR

  • Reliable structured output requires three layers: schema definition (Pydantic / TypeBox / Zod) → constrained decoding on the server (XGrammar, Outlines) → validation + retry on the client (Instructor, structured-outputs APIs).
  • Pydantic is the universal frontend. Define your schema as a Python class; libraries (instructor, openai’s response_format, anthropic’s tool-use, vllm) consume it directly.
  • The server doing constrained decoding is what matters most. Without it, you’re trusting the model to produce valid JSON ~95% of the time, retrying the rest. With it, output is guaranteed.
  • For 2026 production: vLLM v1 / SGLang / TensorRT-LLM ship XGrammar; OpenAI / Anthropic APIs enforce schemas natively. Default to response_format=PydanticClass; don’t roll your own JSON parsing.
  • Schemas should be simple, flat, well-named. Deeply-nested schemas with many enums confuse small models; LLM-friendly schemas align with how training data was shaped.

Why this matters

Every agent / tool-use / RAG-with-citations / data-extraction product needs structured output. Pre-2024 the standard solution was “ask nicely + try/except + retry.” That works at 90% reliability and breaks scaling. Modern stacks make 100% structural correctness the default — schema in, validated object out, no retries. Knowing the layered architecture is what lets you build agents that don’t fall over.

Mental model

Three checkpoints — schema, decoding, validation. Get one right and you’re at 95%; all three and you’re at 100%.

Concrete walkthrough

Pydantic + an LLM API

from pydantic import BaseModel from openai import OpenAI class MovieRecommendation(BaseModel): title: str year: int rating: float reasoning: str client = OpenAI() response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}], response_format=MovieRecommendation, ) movie: MovieRecommendation = response.choices[0].message.parsed print(movie.title, movie.rating)

response_format=MovieRecommendation does three things: extracts the JSON Schema, sends it to the API, the API enforces it via constrained decoding, and the result is validated + parsed into the Pydantic instance. No try/except, no retry.

Anthropic’s tool-use shape

import anthropic from pydantic import BaseModel class MovieRecommendation(BaseModel): title: str year: int rating: float client = anthropic.Anthropic() response = client.messages.create( model="claude-sonnet-4", max_tokens=1024, tools=[{ "name": "recommend_movie", "description": "Recommend a movie", "input_schema": MovieRecommendation.model_json_schema(), }], tool_choice={"type": "tool", "name": "recommend_movie"}, messages=[{"role": "user", "content": "Pick a movie for someone who liked Inception."}], ) # Result is in response.content[0].input as a dict movie = MovieRecommendation.model_validate(response.content[0].input)

Different surface, same machinery. The “tool” is actually a way to specify “force the model to produce output matching this schema.”

Instructor — the universal wrapper

Instructor  wraps every major LLM API behind one Pydantic-first interface:

import instructor from openai import OpenAI from pydantic import BaseModel client = instructor.from_openai(OpenAI()) class Movie(BaseModel): title: str year: int movie = client.chat.completions.create( model="gpt-4o", response_model=Movie, messages=[...], ) # movie is a Movie instance — already parsed and validated

Same code works against Anthropic, Cohere, Google, Mistral, locally-served vLLM/SGLang/llama.cpp. Instructor handles the API-specific shape; you write Pydantic.

vLLM-side enforcement

When self-hosting:

from vllm import LLM, SamplingParams from outlines.serve.vllm import RegexLogitsProcessor, JSONLogitsProcessor llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct") class Movie(BaseModel): title: str year: int processor = JSONLogitsProcessor(Movie, llm.get_tokenizer()) params = SamplingParams(logits_processors=[processor], temperature=0.7) out = llm.generate(prompts, params) # out[0].outputs[0].text is guaranteed JSON matching Movie

vLLM v1 ships XGrammar as the default constrained-decoding backend; the same response_format shape from OpenAI works against vLLM with a v1+ engine config flag.

Schema design — what works and what doesn’t

Works:

class Order(BaseModel): customer_name: str items: list[str] total_usd: float shipping_address: str

Flat, descriptive names, simple types. Models produce this at 99%+ reliability.

Hurts:

class Address(BaseModel): street: str city: str state: str zip: str class CustomerProfile(BaseModel): name: str addresses: dict[str, Address] # nested with arbitrary keys preferences: list[dict[str, list[str]]] # nested unstructured

Deeply-nested schemas with arbitrary keys and dict[str, ...] are doable but reduce reliability. Small models (≤7B) really struggle. Flatten when you can.

Helpful: rich descriptions:

class Movie(BaseModel): title: str = Field(description="The movie title, in English") year: int = Field(description="Release year between 1900 and 2026") rating: float = Field(ge=0, le=10, description="Critic rating from 0 to 10")

Descriptions get into the JSON Schema and into the model’s context. Small models especially benefit.

Retry on validation failure (the rare case)

Even with constrained decoding, semantic validation can fail — the JSON is structurally valid but year is 1850 (out of your Field(ge=1900) range). For these:

import instructor from openai import OpenAI from pydantic import BaseModel, Field client = instructor.from_openai(OpenAI()) class Movie(BaseModel): year: int = Field(ge=1900, le=2026) movie = client.chat.completions.create( model="gpt-4o", response_model=Movie, max_retries=3, # instructor handles validation-failure retries messages=[...], )

max_retries=3 automatically re-prompts the model with the validation error on failure. Typically a 2nd attempt succeeds. Clean integration with Pydantic validation rules.

Streaming structured output

A long Pydantic object can be streamed:

from instructor import partial from pydantic import BaseModel class StoryReview(BaseModel): title: str summary: str rating: int pros: list[str] cons: list[str] partial_review = client.chat.completions.create_partial( model="gpt-4o", response_model=StoryReview, messages=[...], ) async for delta in partial_review: print(delta) # incrementally-filled StoryReview as it generates

Each yield is a partial Pydantic instance with as many fields as have been generated so far. Useful for UIs that want to show fields as they fill in.

When NOT to use structured output

For the model’s reasoning trace (R1-style, o-series), don’t constrain — that hurts quality. Constrain only the final answer extraction. The two-pass pattern:

# Pass 1: free-form reasoning reasoning = client.chat.completions.create(model="o3", messages=[...]) # Pass 2: structured extraction from the reasoning final = client.beta.chat.completions.parse( model="gpt-4o", response_format=Answer, messages=[..., {"role": "assistant", "content": reasoning.content}], )

This gives you the best of both: free reasoning, structured deliverable. Default for any reasoning-heavy product.

Run it in your browser — schema-shaped JSON validator

Python — editableHand-rolled validator with the same shape as Pydantic — no library dependency.
Ctrl+Enter to run

The pipeline shape — schema → JSON → validate → retry on semantic-only failures — is the entire structured-output stack in miniature.

Quick check

Fill in the blank
The Python library that wraps every major LLM API behind a Pydantic-first interface (response_model=YourClass):
Single word; named after the role of guiding the LLM.
Quick check
A team's chat product needs to output recommendations as structured JSON. They prompt 'respond in JSON' and parse with json.loads. ~5% of responses fail. Best fix:

Key takeaways

  1. Three layers: Pydantic schema → constrained decoding → Pydantic validation.
  2. response_format=PydanticClass is the canonical OpenAI shape; tools=[...] is Anthropic’s; Instructor unifies them.
  3. Schema design matters: flat, descriptive, with Field(description=...). Avoid deep nesting and arbitrary dict keys.
  4. Don’t constrain reasoning traces; constrain only the final-answer extraction step.
  5. Streaming partial Pydantic objects is supported by Instructor — great for progressive UIs.

Go deeper