Multimodal (Vision)
A scanned PDF invoice lands in your bucket. Twenty years ago you’d reach for Tesseract and a regex; five years ago you’d train a layout-aware OCR model on a thousand labelled examples; in 2026 you write client.messages.create(content=[{"type":"document",...}, {"type":"text","text":"extract line items as JSON"}]) and a frontier hands you 99% of what a human would extract. The model doing this is not a vision system bolted onto a language system — it’s the same transformer decoder you’ve already learned, fed token sequences from a vision encoder instead of a tokenizer.
The architectural punchline is that vision is not a special case. A ViT chops the image into ~256–576 patches, projects each into the LLM’s token-embedding space, and from there the decoder can’t tell which tokens were pixels and which were words. Every text-side technique you already know — , , structured output via Pydantic — composes with vision unchanged. That’s why “make it multimodal” went from a research project to a content-block field on the API in three years.
This lesson is the production layer of that fact: which open weights actually work in 2026, what document understanding costs in tokens and dollars, and where VLMs still fail in ways your eval has to catch.
TL;DR
- A vision-language model (VLM) processes images alongside text. Architecture: image encoder (ViT or CNN) → projection layer → transformer decoder. Works exactly like text-only LLMs at the decoder stage.
- CLIP (OpenAI, 2021) was the breakthrough — contrastive image-text training producing aligned embeddings. Still the foundation of most vision retrieval; SigLIP (Google, 2023) is the modern improvement.
- Frontier VLMs in 2026: Claude Sonnet/Opus 4 with vision, GPT-4o, Gemini 2.x, Llama 3.2-Vision, Qwen2-VL, InternVL2 — all natively multimodal, not text-then-vision adapters.
- Document understanding (PDFs, charts, screenshots) is the killer app: extract structured data from unstructured visual content. Used heavily in agents that need to “see” software UIs.
- For 2026 production: prefer hosted VLMs for general use; self-host Qwen2-VL or InternVL2 when on-device or compliance matters.
Mental model
Image becomes token sequence; text becomes token sequence; LLM decoder doesn’t care which is which.
What CLIP / SigLIP do
CLIP trains an image encoder and a text encoder jointly to produce vectors that are close when the image and text are paired. Output: two spaces aligned. Used for:
- Zero-shot image classification: encode the image, encode candidate labels, pick max cosine similarity.
- Image search: encode all images offline; encode the query text; cosine ranking.
- Image clustering / dedup: same as text embeddings.
CLIP at ViT-L/14: 512-dim embeddings, ~70% top-1 ImageNet zero-shot. SigLIP (Sigmoid Loss for Image-Pretraining) hits ~80% with better training stability. Both are open and small (~400M params).
CLIP/SigLIP is not a generative model — they don’t produce text from images. For that, you need a VLM.
How modern VLMs work (Llama 3.2-Vision, Qwen2-VL, etc.)
The pattern, every 2024+ frontier VLM:
- Image encoder: ViT-L/14 or similar. Input: image. Output: ~256–576 patch embeddings (depending on resolution).
- Projection layer: 1–2 linear layers (sometimes a small MLP) mapping image embeddings into the text token-embedding space.
- Decoder LLM: standard transformer decoder. Image tokens are interleaved with text tokens in the input sequence. Decoder generates text autoregressively.
Training: large image-text pair datasets (LAION 5B + filtered web crawls). The vision encoder is sometimes pretrained CLIP/SigLIP-style first, then jointly fine-tuned.
The architectural cleanliness: image processing is a pre-processing step; the decoder doesn’t know it’s “doing vision.” This is why all the LLM-side techniques (, , structured output) compose seamlessly with vision.
Frontier VLMs (April 2026)
| Model | Open? | Resolution | Strengths |
|---|---|---|---|
| Claude Sonnet/Opus 4 vision | Closed (API) | up to 8K | Strong on charts, complex documents |
| GPT-4o / GPT-5 | Closed | dynamic | High-quality general-purpose; OCR-strong |
| Gemini 2.x | Closed | up to 2M context with images | Long-document multimodal |
| Llama 3.2-Vision (11B / 90B) | Open | 1024 | Best open general-purpose VLM through 2024 |
| Qwen2-VL (2B / 7B / 72B) | Open | up to 2K | Strong on Chinese; great OCR |
| InternVL2 (2B–76B) | Open | up to 4K | Best open document-understanding |
| Pixtral (12B) | Open (Mistral) | 1024 | Strong general-purpose; recent |
For most products: Claude Sonnet 4 vision (API) or Qwen2-VL-7B / Llama-3.2-Vision-11B (self-hosted) is the right starting point.
Document understanding — the production use case
from anthropic import Anthropic
import base64
client = Anthropic()
with open("invoice.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64}},
{"type": "text", "text": "Extract: invoice number, date, total, line items as structured JSON."},
],
}],
)Claude’s vision processes the PDF page-by-page, OCRs text, recognizes table structure, returns the requested JSON. For typical invoices, ~99% accuracy on structured fields.
Same flow with Qwen2-VL self-hosted: convert PDF to images → feed into the model with the same instruction.
Charts and graphs
Modern VLMs read charts well. The trick: ask precisely.
Bad: "What does the chart show?"
Better: "What was the Q3 2024 revenue in millions? Read the y-axis exactly."Frontier VLMs (Claude, Gemini, GPT-4o) hit ~85% accuracy on benchmark chart-QA tasks (ChartQA). Open VLMs (Qwen2-VL, Llama-3.2-Vision) are at ~75%. For chart-reading: prefer hosted; if self-hosting, use 70B+ class models.
When VLMs fail
- Tiny text: dense PDFs at low DPI, model misreads. Convert PDF to high-DPI images first.
- Hand-drawn diagrams / sketches: variable; test on representative samples.
- Math / equations: better than 2023 but still sometimes hallucinates symbols. For LaTeX extraction, Mathpix or specialized models still beat general VLMs.
- Adversarial images / spoofing: generally weak. Don’t deploy as a security gatekeeper.
Cost / latency
A hosted VLM call with one ~1MP image:
- ~1500 tokens for the image (varies by provider)
- ~500–2000 tokens output
- ~$0.005–0.02 per call at 2026 rates
Latency: ~2–5 s for the response. Don’t put VLM calls in synchronous request paths; use background processing.
Run it in your browser — image-as-tokens shape
The shape — image patches as a token sequence — is the entire bridge between vision and language. A 224×224 image at patch=14 is 256 tokens.
Quick check
Key takeaways
- VLMs = image encoder → projection → LLM decoder. Image becomes token sequence; LLM doesn’t care.
- CLIP / SigLIP are the embedding-only foundation; modern VLMs (Llama-3.2-V, Qwen2-VL, Claude vision) are generative.
- Document understanding is the killer use case. PDF → JSON via VLM + structured output, ~99% accuracy.
- Hosted (Claude / GPT / Gemini) for highest quality; Qwen2-VL / Llama-3.2-V for self-hosted.
- Cost ~$0.005–0.02 per call for hosted; latency 2–5 s. Use async pipelines.
Go deeper
- PaperLearning Transferable Visual Models From Natural Language Supervision (CLIP)The CLIP paper. Section 2 has the contrastive loss; section 4 has the zero-shot recipe.
- PaperSigmoid Loss for Language Image Pre-Training (SigLIP)CLIP's improved successor. Section 3 explains why sigmoid loss is more training-stable than softmax.
- PaperQwen2-VL Technical ReportBest open VLM technical paper. The dynamic-resolution strategy is novel.
- PaperLlama 3.2 — VisionMeta's VLM technical writeup; covers the 11B and 90B variants.
- DocsAnthropic — VisionHow to use Claude's vision API. The cost / token-counting section is essential.
- BlogHugging Face — Vision-Language Model SurveyLiving survey of open VLMs with benchmarks. Updated frequently.
- RepoQwenLM/Qwen2-VLReference inference + fine-tuning code for the Qwen2-VL family.
TL;DR
- A vision-language model (VLM) processes images alongside text. Architecture: image encoder (ViT or CNN) → projection layer → transformer decoder. Works exactly like text-only LLMs at the decoder stage.
- CLIP (OpenAI, 2021) was the breakthrough — contrastive image-text training producing aligned embeddings. Still the foundation of most vision retrieval; SigLIP (Google, 2023) is the modern improvement.
- Frontier VLMs in 2026: Claude Sonnet/Opus 4 with vision, GPT-4o, Gemini 2.x, Llama 3.2-Vision, Qwen2-VL, InternVL2 — all natively multimodal, not text-then-vision adapters.
- Document understanding (PDFs, charts, screenshots) is the killer app: extract structured data from unstructured visual content. Used heavily in agents that need to “see” software UIs.
- For 2026 production: prefer hosted VLMs for general use; self-host Qwen2-VL or InternVL2 when on-device or compliance matters.
Why this matters
50% of useful enterprise data is locked in PDFs, screenshots, charts, and PowerPoints. Pre-2024, extracting this required OCR + heuristics — fragile and lossy. Modern VLMs read all of it natively. Multi-modal is no longer a special case in production AI; it’s the default. Knowing how VLMs work, what they cost, and where they fail is what lets you build “extract everything from this PDF” features that actually work.
Mental model
Image becomes token sequence; text becomes token sequence; LLM decoder doesn’t care which is which.
Concrete walkthrough
What CLIP / SigLIP do
CLIP trains an image encoder and a text encoder jointly to produce vectors that are close when the image and text are paired. Output: two embedding spaces aligned. Used for:
- Zero-shot image classification: encode the image, encode candidate labels, pick max cosine similarity.
- Image search: encode all images offline; encode the query text; cosine ranking.
- Image clustering / dedup: same as text embeddings.
CLIP at ViT-L/14: 512-dim embeddings, ~70% top-1 ImageNet zero-shot. SigLIP (Sigmoid Loss for Image-Pretraining) hits ~80% with better training stability. Both are open and small (~400M params).
CLIP/SigLIP is not a generative model — they don’t produce text from images. For that, you need a VLM.
How modern VLMs work (Llama 3.2-Vision, Qwen2-VL, etc.)
The pattern, every 2024+ frontier VLM:
- Image encoder: ViT-L/14 or similar. Input: image. Output: ~256–576 patch embeddings (depending on resolution).
- Projection layer: 1–2 linear layers (sometimes a small MLP) mapping image embeddings into the text token-embedding space.
- Decoder LLM: standard transformer decoder. Image tokens are interleaved with text tokens in the input sequence. Decoder generates text autoregressively.
Training: large image-text pair datasets (LAION 5B + filtered web crawls). The vision encoder is sometimes pretrained CLIP/SigLIP-style first, then jointly fine-tuned.
The architectural cleanliness: image processing is a pre-processing step; the decoder doesn’t know it’s “doing vision.” This is why all the LLM-side techniques (RAG, tool use, structured output) compose seamlessly with vision.
Frontier VLMs (April 2026)
| Model | Open? | Resolution | Strengths |
|---|---|---|---|
| Claude Sonnet/Opus 4 vision | Closed (API) | up to 8K | Strong on charts, complex documents |
| GPT-4o / GPT-5 | Closed | dynamic | High-quality general-purpose; OCR-strong |
| Gemini 2.x | Closed | up to 2M context with images | Long-document multimodal |
| Llama 3.2-Vision (11B / 90B) | Open | 1024 | Best open general-purpose VLM through 2024 |
| Qwen2-VL (2B / 7B / 72B) | Open | up to 2K | Strong on Chinese; great OCR |
| InternVL2 (2B–76B) | Open | up to 4K | Best open document-understanding |
| Pixtral (12B) | Open (Mistral) | 1024 | Strong general-purpose; recent |
For most products: Claude Sonnet 4 vision (API) or Qwen2-VL-7B / Llama-3.2-Vision-11B (self-hosted) is the right starting point.
Document understanding — the production use case
from anthropic import Anthropic
import base64
client = Anthropic()
with open("invoice.pdf", "rb") as f:
pdf_b64 = base64.b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_b64}},
{"type": "text", "text": "Extract: invoice number, date, total, line items as structured JSON."},
],
}],
)Claude’s vision processes the PDF page-by-page, OCRs text, recognizes table structure, returns the requested JSON. For typical invoices, ~99% accuracy on structured fields.
Same flow with Qwen2-VL self-hosted: convert PDF to images → feed into the model with the same instruction.
Charts and graphs
Modern VLMs read charts well. The trick: ask precisely.
Bad: "What does the chart show?"
Better: "What was the Q3 2024 revenue in millions? Read the y-axis exactly."Frontier VLMs (Claude, Gemini, GPT-4o) hit ~85% accuracy on benchmark chart-QA tasks (ChartQA). Open VLMs (Qwen2-VL, Llama-3.2-Vision) are at ~75%. For chart-reading: prefer hosted; if self-hosting, use 70B+ class models.
When VLMs fail
- Tiny text: dense PDFs at low DPI, model misreads. Convert PDF to high-DPI images first.
- Hand-drawn diagrams / sketches: variable; test on representative samples.
- Math / equations: better than 2023 but still sometimes hallucinates symbols. For LaTeX extraction, Mathpix or specialized models still beat general VLMs.
- Adversarial images / spoofing: generally weak. Don’t deploy as a security gatekeeper.
Cost / latency
A hosted VLM call with one ~1MP image:
- ~1500 tokens for the image (varies by provider)
- ~500–2000 tokens output
- ~$0.005–0.02 per call at 2026 rates
Latency: ~2–5 s for the response. Don’t put VLM calls in synchronous request paths; use background processing.
Run it in your browser — image-as-tokens shape
The shape — image patches as a token sequence — is the entire bridge between vision and language. A 224×224 image at patch=14 is 256 tokens.
Quick check
Key takeaways
- VLMs = image encoder → projection → LLM decoder. Image becomes token sequence; LLM doesn’t care.
- CLIP / SigLIP are the embedding-only foundation; modern VLMs (Llama-3.2-V, Qwen2-VL, Claude vision) are generative.
- Document understanding is the killer use case. PDF → JSON via VLM + structured output, ~99% accuracy.
- Hosted (Claude / GPT / Gemini) for highest quality; Qwen2-VL / Llama-3.2-V for self-hosted.
- Cost ~$0.005–0.02 per call for hosted; latency 2–5 s. Use async pipelines.
Go deeper
- PaperLearning Transferable Visual Models From Natural Language Supervision (CLIP)The CLIP paper. Section 2 has the contrastive loss; section 4 has the zero-shot recipe.
- PaperSigmoid Loss for Language Image Pre-Training (SigLIP)CLIP's improved successor. Section 3 explains why sigmoid loss is more training-stable than softmax.
- PaperQwen2-VL Technical ReportBest open VLM technical paper. The dynamic-resolution strategy is novel.
- PaperLlama 3.2 — VisionMeta's VLM technical writeup; covers the 11B and 90B variants.
- DocsAnthropic — VisionHow to use Claude's vision API. The cost / token-counting section is essential.
- BlogHugging Face — Vision-Language Model SurveyLiving survey of open VLMs with benchmarks. Updated frequently.
- RepoQwenLM/Qwen2-VLReference inference + fine-tuning code for the Qwen2-VL family.