Prompt Engineering Foundations
When you call client.messages.create(messages=[{"role": "user", "content": "..."}], model="claude-sonnet-4-5") — or the OpenAI-shaped client.chat.completions.create(...) — the entire surface area you have to steer the model is that messages array. No fine-tuning, no infrastructure, no extra dollars. Just words. A good prompt is often the difference between 60% and 90% task accuracy on the same model — the cheapest, fastest lever in your stack.
But there’s a cottage industry of prompt advice that’s pure superstition. “Take a deep breath.” “You’ll be tipped $200.” “Pretend you’re an expert.” Most of these helped a bit on GPT-3.5 in 2023 and do nothing measurable on frontier models in 2026. The four techniques that actually move the metric — zero-shot, few-shot, , self-consistency — each have known regimes where they help and known regimes where they hurt. This lesson is the calibrated version, with measurements.
TL;DR
- Zero-shot is the default. Add complexity only when the model fails — every extra technique is more tokens, more cost, more places to fail.
- Few-shot locks in format and style. It can also lock in errors if your examples are wrong; a bad few-shot is worse than zero-shot.
- Chain-of-thought helps on multi-step problems (math, planning) and hurts on simple classification. Not a free lunch.
- Self-consistency = sample N CoT reasonings, take the majority answer. Genuinely better on reasoning tasks; N× cost.
- Structured output (JSON mode, Pydantic, Outlines) is how you build real systems. Always prefer schema-validated output over regex.
Mental model
Each technique is a tradeoff between cost (tokens, latency, $) and task fit. Here’s the order I reach for them:
The four techniques in code
# 1. Zero-shot — the default.
prompt = "Classify the sentiment of this review: 'Great battery, terrible camera.'"
# 2. Few-shot — add 2-5 examples.
prompt = """Review: 'Loved the build, hated the price.' Sentiment: mixed
Review: 'Best phone I've owned.' Sentiment: positive
Review: 'Battery dies in 4 hours.' Sentiment: negative
Review: 'Great battery, terrible camera.' Sentiment:"""
# 3. Chain-of-thought — ask for reasoning before the answer.
prompt = """Q: A train leaves Boston at 60 mph. Another leaves NYC at 80 mph
toward each other. They're 200 miles apart. When do they meet?
Let's think step by step."""
# 4. Self-consistency — sample N reasonings, vote on the answer.
answers = [model(prompt, temperature=0.7) for _ in range(8)]
final = most_common([extract_answer(a) for a in answers])Real numbers (from the GSM8K math benchmark, roughly):
| Technique | GPT-4o-mini accuracy | Tokens / Q | Cost @ $0.15/1M |
|---|---|---|---|
| Zero-shot | 76% | 200 | 1× |
| 5-shot | 79% | 600 | 3× |
| Zero-shot CoT | 88% | 400 | 2× |
| Self-consistency (n=5) | 92% | 2000 | 10× |
CoT triples accuracy on reasoning. Self-consistency adds another point at 5× the cost — sometimes worth it, often not.
Counterexample. On a simple sentiment classifier, zero-shot CoT can hurt — the model talks itself into wrong answers. CoT is for tasks where the model needs to do reasoning, not where it already knows the answer.
Run it in your browser
A direct comparison you can run right now. No API key — this is a deterministic local stand-in showing how the techniques read. For real model calls, paste each prompt into your provider of choice and compare outputs side-by-side.
Quick check
Key takeaways
- Climb the ladder of complexity only when needed. Zero-shot → few-shot → CoT → self-consistency. Each step adds cost.
- CoT is for reasoning, not classification. Test before assuming it helps.
- Few-shot examples must be correct. A wrong example actively poisons the output.
- Always validate structured output with a schema. Don’t
json.loadsand pray. - Measure on a real eval set. Vibes-based prompt tuning is how teams ship regressions.
Go deeper
- PaperChain-of-Thought Prompting Elicits Reasoning in LLMsThe original CoT paper. Short, foundational, still cited daily.
- PaperSelf-Consistency Improves Chain-of-Thought ReasoningThe math behind why majority-vote CoT works.
- DocsAnthropic — Prompt Engineering OverviewThe most opinionated current guide; written by people who actually train these models.
- VideoKarpathy — Deep Dive into LLMs like ChatGPT3-hour deep dive. The prompting section is the best 30 minutes you can spend on the topic.
- BlogPrompt Engineering GuideThe most comprehensive open survey. Use as a reference, not a manifesto.
TL;DR
- Zero-shot is the default. Add complexity only when the model fails — every extra technique is more tokens, more cost, more places to fail.
- Few-shot locks in format and style. It can also lock in errors if your examples are wrong; a bad few-shot is worse than zero-shot.
- Chain-of-thought helps on multi-step problems (math, planning) and hurts on simple classification. Not a free lunch.
- Self-consistency = sample N CoT reasonings, take the majority answer. Genuinely better on reasoning tasks; N× cost.
- Structured output (JSON mode, Pydantic, Outlines) is how you build real systems. Always prefer schema-validated output over regex.
Why this matters
Prompting is the cheapest, fastest lever you have. A good prompt is often the difference between a 60% and 90% task accuracy with the same model — no fine-tuning, no infrastructure, just words. But there’s a cottage industry of prompt advice that’s pure superstition. This lesson covers the techniques that actually move the metric, with measurements.
Mental model
Each technique is a tradeoff between cost (tokens, latency, $) and task fit. Here’s the order I reach for them:
Concrete walkthrough
The four techniques, written out:
# 1. Zero-shot — the default.
prompt = "Classify the sentiment of this review: 'Great battery, terrible camera.'"
# 2. Few-shot — add 2-5 examples.
prompt = """Review: 'Loved the build, hated the price.' Sentiment: mixed
Review: 'Best phone I've owned.' Sentiment: positive
Review: 'Battery dies in 4 hours.' Sentiment: negative
Review: 'Great battery, terrible camera.' Sentiment:"""
# 3. Chain-of-thought — ask for reasoning before the answer.
prompt = """Q: A train leaves Boston at 60 mph. Another leaves NYC at 80 mph
toward each other. They're 200 miles apart. When do they meet?
Let's think step by step."""
# 4. Self-consistency — sample N reasonings, vote on the answer.
answers = [model(prompt, temperature=0.7) for _ in range(8)]
final = most_common([extract_answer(a) for a in answers])Real numbers (from the GSM8K math benchmark, roughly):
| Technique | GPT-4o-mini accuracy | Tokens / Q | Cost @ $0.15/1M |
|---|---|---|---|
| Zero-shot | 76% | 200 | 1× |
| 5-shot | 79% | 600 | 3× |
| Zero-shot CoT | 88% | 400 | 2× |
| Self-consistency (n=5) | 92% | 2000 | 10× |
CoT triples accuracy on reasoning. Self-consistency adds another point at 5× the cost — sometimes worth it, often not.
Counterexample. On a simple sentiment classifier, zero-shot CoT can hurt — the model talks itself into wrong answers. CoT is for tasks where the model needs to do reasoning, not where it already knows the answer.
Run it in your browser
A direct comparison you can run right now. No API key — this is a deterministic local stand-in showing how the techniques read. For real model calls, paste each prompt into your provider of choice and compare outputs side-by-side.
Quick check
Key takeaways
- Climb the ladder of complexity only when needed. Zero-shot → few-shot → CoT → self-consistency. Each step adds cost.
- CoT is for reasoning, not classification. Test before assuming it helps.
- Few-shot examples must be correct. A wrong example actively poisons the output.
- Always validate structured output with a schema. Don’t
json.loadsand pray. - Measure on a real eval set. Vibes-based prompt tuning is how teams ship regressions.
Go deeper
- PaperChain-of-Thought Prompting Elicits Reasoning in LLMsThe original CoT paper. Short, foundational, still cited daily.
- PaperSelf-Consistency Improves Chain-of-Thought ReasoningThe math behind why majority-vote CoT works.
- DocsAnthropic — Prompt Engineering OverviewThe most opinionated current guide; written by people who actually train these models.
- VideoKarpathy — Deep Dive into LLMs like ChatGPT3-hour deep dive. The prompting section is the best 30 minutes you can spend on the topic.
- BlogPrompt Engineering GuideThe most comprehensive open survey. Use as a reference, not a manifesto.