Skip to content

Prompt Engineering Foundations

When you call client.messages.create(messages=[{"role": "user", "content": "..."}], model="claude-sonnet-4-5") — or the OpenAI-shaped client.chat.completions.create(...) — the entire surface area you have to steer the model is that messages array. No fine-tuning, no infrastructure, no extra dollars. Just words. A good prompt is often the difference between 60% and 90% task accuracy on the same model — the cheapest, fastest lever in your stack.

But there’s a cottage industry of prompt advice that’s pure superstition. “Take a deep breath.” “You’ll be tipped $200.” “Pretend you’re an expert.” Most of these helped a bit on GPT-3.5 in 2023 and do nothing measurable on frontier models in 2026. The four techniques that actually move the metric — zero-shot, few-shot, , self-consistency — each have known regimes where they help and known regimes where they hurt. This lesson is the calibrated version, with measurements.

TL;DR

  • Zero-shot is the default. Add complexity only when the model fails — every extra technique is more tokens, more cost, more places to fail.
  • Few-shot locks in format and style. It can also lock in errors if your examples are wrong; a bad few-shot is worse than zero-shot.
  • Chain-of-thought helps on multi-step problems (math, planning) and hurts on simple classification. Not a free lunch.
  • Self-consistency = sample N CoT reasonings, take the majority answer. Genuinely better on reasoning tasks; N× cost.
  • Structured output (JSON mode, Pydantic, Outlines) is how you build real systems. Always prefer schema-validated output over regex.

Mental model

Each technique is a tradeoff between cost (tokens, latency, $) and task fit. Here’s the order I reach for them:

The four techniques in code

# 1. Zero-shot — the default. prompt = "Classify the sentiment of this review: 'Great battery, terrible camera.'" # 2. Few-shot — add 2-5 examples. prompt = """Review: 'Loved the build, hated the price.' Sentiment: mixed Review: 'Best phone I've owned.' Sentiment: positive Review: 'Battery dies in 4 hours.' Sentiment: negative Review: 'Great battery, terrible camera.' Sentiment:""" # 3. Chain-of-thought — ask for reasoning before the answer. prompt = """Q: A train leaves Boston at 60 mph. Another leaves NYC at 80 mph toward each other. They're 200 miles apart. When do they meet? Let's think step by step.""" # 4. Self-consistency — sample N reasonings, vote on the answer. answers = [model(prompt, temperature=0.7) for _ in range(8)] final = most_common([extract_answer(a) for a in answers])

Real numbers (from the GSM8K math benchmark, roughly):

TechniqueGPT-4o-mini accuracyTokens / QCost @ $0.15/1M
Zero-shot76%200
5-shot79%600
Zero-shot CoT88%400
Self-consistency (n=5)92%200010×

CoT triples accuracy on reasoning. Self-consistency adds another point at 5× the cost — sometimes worth it, often not.

Counterexample. On a simple sentiment classifier, zero-shot CoT can hurt — the model talks itself into wrong answers. CoT is for tasks where the model needs to do reasoning, not where it already knows the answer.

Run it in your browser

A direct comparison you can run right now. No API key — this is a deterministic local stand-in showing how the techniques read. For real model calls, paste each prompt into your provider of choice and compare outputs side-by-side.

Python — editableCompare prompt structures side-by-side.
Ctrl+Enter to run

Quick check

Quick check
You're building a sentiment classifier for short reviews. Zero-shot scores 87%. You add 'Let's think step by step' — accuracy drops to 82%. What's most likely happening?

Key takeaways

  1. Climb the ladder of complexity only when needed. Zero-shot → few-shot → CoT → self-consistency. Each step adds cost.
  2. CoT is for reasoning, not classification. Test before assuming it helps.
  3. Few-shot examples must be correct. A wrong example actively poisons the output.
  4. Always validate structured output with a schema. Don’t json.loads and pray.
  5. Measure on a real eval set. Vibes-based prompt tuning is how teams ship regressions.

Go deeper

TL;DR

  • Zero-shot is the default. Add complexity only when the model fails — every extra technique is more tokens, more cost, more places to fail.
  • Few-shot locks in format and style. It can also lock in errors if your examples are wrong; a bad few-shot is worse than zero-shot.
  • Chain-of-thought helps on multi-step problems (math, planning) and hurts on simple classification. Not a free lunch.
  • Self-consistency = sample N CoT reasonings, take the majority answer. Genuinely better on reasoning tasks; N× cost.
  • Structured output (JSON mode, Pydantic, Outlines) is how you build real systems. Always prefer schema-validated output over regex.

Why this matters

Prompting is the cheapest, fastest lever you have. A good prompt is often the difference between a 60% and 90% task accuracy with the same model — no fine-tuning, no infrastructure, just words. But there’s a cottage industry of prompt advice that’s pure superstition. This lesson covers the techniques that actually move the metric, with measurements.

Mental model

Each technique is a tradeoff between cost (tokens, latency, $) and task fit. Here’s the order I reach for them:

Concrete walkthrough

The four techniques, written out:

# 1. Zero-shot — the default. prompt = "Classify the sentiment of this review: 'Great battery, terrible camera.'" # 2. Few-shot — add 2-5 examples. prompt = """Review: 'Loved the build, hated the price.' Sentiment: mixed Review: 'Best phone I've owned.' Sentiment: positive Review: 'Battery dies in 4 hours.' Sentiment: negative Review: 'Great battery, terrible camera.' Sentiment:""" # 3. Chain-of-thought — ask for reasoning before the answer. prompt = """Q: A train leaves Boston at 60 mph. Another leaves NYC at 80 mph toward each other. They're 200 miles apart. When do they meet? Let's think step by step.""" # 4. Self-consistency — sample N reasonings, vote on the answer. answers = [model(prompt, temperature=0.7) for _ in range(8)] final = most_common([extract_answer(a) for a in answers])

Real numbers (from the GSM8K math benchmark, roughly):

TechniqueGPT-4o-mini accuracyTokens / QCost @ $0.15/1M
Zero-shot76%200
5-shot79%600
Zero-shot CoT88%400
Self-consistency (n=5)92%200010×

CoT triples accuracy on reasoning. Self-consistency adds another point at 5× the cost — sometimes worth it, often not.

Counterexample. On a simple sentiment classifier, zero-shot CoT can hurt — the model talks itself into wrong answers. CoT is for tasks where the model needs to do reasoning, not where it already knows the answer.

Run it in your browser

A direct comparison you can run right now. No API key — this is a deterministic local stand-in showing how the techniques read. For real model calls, paste each prompt into your provider of choice and compare outputs side-by-side.

Python — editableCompare prompt structures side-by-side.
Ctrl+Enter to run

Quick check

Quick check
You're building a sentiment classifier for short reviews. Zero-shot scores 87%. You add 'Let's think step by step' — accuracy drops to 82%. What's most likely happening?

Key takeaways

  1. Climb the ladder of complexity only when needed. Zero-shot → few-shot → CoT → self-consistency. Each step adds cost.
  2. CoT is for reasoning, not classification. Test before assuming it helps.
  3. Few-shot examples must be correct. A wrong example actively poisons the output.
  4. Always validate structured output with a schema. Don’t json.loads and pray.
  5. Measure on a real eval set. Vibes-based prompt tuning is how teams ship regressions.

Go deeper