Prompt Engineering Foundations

When you call client.messages.create(messages=[{"role": "user", "content": "..."}], model="claude-sonnet-4-5") — or the OpenAI-shaped client.chat.completions.create(...) — the entire surface area you have to steer the model is that messages array. No fine-tuning, no infrastructure, no extra dollars. Just words. A good prompt is often the difference between 60% and 90% task accuracy on the same model — the cheapest, fastest lever in your stack.

But there’s a cottage industry of prompt advice that’s pure superstition. “Take a deep breath.” “You’ll be tipped $200.” “Pretend you’re an expert.” Most of these helped a bit on GPT-3.5 in 2023 and do nothing measurable on frontier models in 2026. The four techniques that actually move the metric — zero-shot, few-shot, , self-consistency — each have known regimes where they help and known regimes where they hurt. This lesson is the calibrated version, with measurements.

TL;DR

Zero-shot is the default. Add complexity only when the model fails — every extra technique is more tokens, more cost, more places to fail.
Few-shot locks in format and style. It can also lock in errors if your examples are wrong; a bad few-shot is worse than zero-shot.
Chain-of-thought helps on multi-step problems (math, planning) and hurts on simple classification. Not a free lunch.
Self-consistency = sample N CoT reasonings, take the majority answer. Genuinely better on reasoning tasks; N× cost.
Structured output (JSON mode, Pydantic, Outlines) is how you build real systems. Always prefer schema-validated output over regex.

Mental model

Each technique is a tradeoff between cost (tokens, latency, $) and task fit. Here’s the order I reach for them:

The four techniques in code


# 1. Zero-shot — the default.
prompt = "Classify the sentiment of this review: 'Great battery, terrible camera.'"
 
# 2. Few-shot — add 2-5 examples.
prompt = """Review: 'Loved the build, hated the price.'  Sentiment: mixed
Review: 'Best phone I've owned.'                          Sentiment: positive
Review: 'Battery dies in 4 hours.'                        Sentiment: negative
Review: 'Great battery, terrible camera.'                 Sentiment:"""
 
# 3. Chain-of-thought — ask for reasoning before the answer.
prompt = """Q: A train leaves Boston at 60 mph. Another leaves NYC at 80 mph
toward each other. They're 200 miles apart. When do they meet?
Let's think step by step."""
 
# 4. Self-consistency — sample N reasonings, vote on the answer.
answers = [model(prompt, temperature=0.7) for _ in range(8)]
final = most_common([extract_answer(a) for a in answers])

Real numbers (from the GSM8K math benchmark, roughly):

Technique	GPT-4o-mini accuracy	Tokens / Q	Cost @ $0.15/1M
Zero-shot	76%	200	1×
5-shot	79%	600	3×
Zero-shot CoT	88%	400	2×
Self-consistency (n=5)	92%	2000	10×

CoT triples accuracy on reasoning. Self-consistency adds another point at 5× the cost — sometimes worth it, often not.

Counterexample. On a simple sentiment classifier, zero-shot CoT can hurt — the model talks itself into wrong answers. CoT is for tasks where the model needs to do reasoning, not where it already knows the answer.

Run it in your browser

A direct comparison you can run right now. No API key — this is a deterministic local stand-in showing how the techniques read. For real model calls, paste each prompt into your provider of choice and compare outputs side-by-side.

Python — editableCompare prompt structures side-by-side.

question = "A pen costs $1.50 and a notebook costs $4.20. How much is 3 pens and 2 notebooks?"

zero_shot = f"Q: {question}\nA:"

few_shot = (
  "Q: A pen costs $1 and a notebook costs $3. Cost of 2 pens and 1 notebook?\n"
  "A: 2*1 + 1*3 = $5\n\n"
  f"Q: {question}\nA:"
)

cot = f"Q: {question}\nLet's think step by step.\nA:"

for label, p in [("zero-shot", zero_shot), ("few-shot", few_shot), ("CoT", cot)]:
  tokens = len(p.split())  # rough
  print(f"--- {label} ({tokens} tokens) ---")
  print(p)
  print()

question = "A pen costs $1.50 and a notebook costs $4.20. How much is 3 pens and 2 notebooks?"

zero_shot = f"Q: {question}\nA:"

few_shot = (
  "Q: A pen costs $1 and a notebook costs $3. Cost of 2 pens and 1 notebook?\n"
  "A: 2*1 + 1*3 = $5\n\n"
  f"Q: {question}\nA:"
)

cot = f"Q: {question}\nLet's think step by step.\nA:"

for label, p in [("zero-shot", zero_shot), ("few-shot", few_shot), ("CoT", cot)]:
  tokens = len(p.split())  # rough
  print(f"--- {label} ({tokens} tokens) ---")
  print(p)
  print()

question = "A pen costs $1.50 and a notebook costs $4.20. How much is 3 pens and 2 notebooks?"

zero_shot = f"Q: {question}\nA:"

few_shot = (
  "Q: A pen costs $1 and a notebook costs $3. Cost of 2 pens and 1 notebook?\n"
  "A: 2*1 + 1*3 = $5\n\n"
  f"Q: {question}\nA:"
)

cot = f"Q: {question}\nLet's think step by step.\nA:"

for label, p in [("zero-shot", zero_shot), ("few-shot", few_shot), ("CoT", cot)]:
  tokens = len(p.split())  # rough
  print(f"--- {label} ({tokens} tokens) ---")
  print(p)
  print()

Ctrl+Enter to run

Quick check

You're building a sentiment classifier for short reviews. Zero-shot scores 87%. You add 'Let's think step by step' — accuracy drops to 82%. What's most likely happening?

Key takeaways

Climb the ladder of complexity only when needed. Zero-shot → few-shot → CoT → self-consistency. Each step adds cost.
CoT is for reasoning, not classification. Test before assuming it helps.
Few-shot examples must be correct. A wrong example actively poisons the output.
Always validate structured output with a schema. Don’t json.loads and pray.
Measure on a real eval set. Vibes-based prompt tuning is how teams ship regressions.

Go deeper

PaperChain-of-Thought Prompting Elicits Reasoning in LLMs · Wei et al., 2022The original CoT paper. Short, foundational, still cited daily.
PaperSelf-Consistency Improves Chain-of-Thought Reasoning · Wang et al., 2022The math behind why majority-vote CoT works.
DocsAnthropic — Prompt Engineering OverviewThe most opinionated current guide; written by people who actually train these models.
VideoKarpathy — Deep Dive into LLMs like ChatGPT · Andrej Karpathy3-hour deep dive. The prompting section is the best 30 minutes you can spend on the topic.
BlogPrompt Engineering GuideThe most comprehensive open survey. Use as a reference, not a manifesto.