Skip to content

Cost & Latency

Prereqs: vLLM & SGLang, KV Cache Basics, Chunked Prefill. This lesson is the playbook — the moves you make once you understand the underlying systems.

A customer-support chatbot runs a 70B model on 8 H100s, serves 1M requests a day, and the bill arrives: 8,400/day.Thesameworkload,samemodel,samehardware,withfivesoftwareleverspulledintherightorder:8,400/day**. The same workload, same model, same hardware, with five software levers pulled in the right order: **350/day. That’s 24× cheaper, no quality regression, no new hardware order, no team rewrite. The levers, in the order you pull them: → model routing → → hardware. Each one multiplies the next, which is why the ordering matters: pulling lever 5 (B200s) before lever 1 (prompt template fixes) means buying capacity that better software would have made unnecessary.

This lesson is the canonical 5-step playbook. The numbers above are illustrative but the order of magnitude and order of operations hold across most production workloads. Latency follows the same logic on different SLOs — in the 200–800 ms band, in the 30–80 ms band — and the same five levers move both. Run the playbook top to bottom on any new serving deployment.

TL;DR

  • Production LLM serving has five economic levers, in roughly increasing implementation cost: prefix caching → continuous batching → model routing → quantization → hardware. Each is multiplicative; doing all five well gets you 10–30× cheaper than naive serving.
  • Prefix caching is the cheapest win and most teams leave it on the table. Hit-rate-aware prompt design takes a day and pays off forever.
  • Continuous batching is on by default in vLLM v1, SGLang, TensorRT-LLM. Old TGI / HF transformers servers leave 5× on the floor — switch.
  • Model routing (small model first, escalate on uncertainty) is the highest-leverage lever for tasks where ~70% of queries are easy. Saves 3–8× on the dollar; no quality loss.
  • FP8 / INT4 quantization halves to quarters memory and bandwidth costs. Modern open weights take quantization with negligible quality regression.
  • Hardware is last because it’s a capex decision. Use the calculator below to compare B200 vs MI355X vs H100 for your workload.

Mental model

Each box, in order, is “the cheapest move I haven’t made yet.” Don’t skip ahead — every box is a multiplier on every later box.

Lever 1 — Prefix caching

See Prefix & RadixAttention for the mechanics. The actionable rule:

  • Move all variable bits to the end of the prompt. Timestamps, request IDs, user IDs, A/B feature flags — all at the bottom. Anything that mutates at offset 0 destroys cache sharing.
  • Pin a stable chat template. Tokenizer drift between client and server is the silent killer of hit rate.
  • Track prefix_cache_hit_rate in Grafana. If it’s below 50% on a chat workload, the prompt template is wrong; fix that before doing anything else.

Typical impact on TTFT: 3–10× on agent / chatbot / eval workloads, where ~80% of the prompt is shared.

Lever 2 — Continuous batching

The principle: do not finish one request before starting the next. The scheduler picks up new sequences mid-batch, evicts finished ones, runs one big packed forward pass per step. Combined with chunked prefill, every step is full of useful work.

This is on by default in v1, v0.4+, TensorRT-LLM. If you’re on legacy Hugging Face TGI or rolling your own with transformers, you’re paying ~5× more than you need to for decode throughput. Migrate.

Lever 3 — Model routing

Fact: most production traffic is easy. ~70% of customer-support questions, ~80% of code-completion calls, ~60% of summarization tasks can be answered by a 7B model that costs 1/30th of a 70B. The hard 20–40% need the big model.

The pattern:

# Pseudocode for the routing decision. def route(query): small_response = small_model.complete(query, max_tokens=64) confidence = score(small_response, query) # log-prob, judge model, retrieval-overlap, etc. if confidence > THRESHOLD: return small_response # ~70% of traffic, ~3% of cost return big_model.complete(query) # ~30% of traffic, ~97% of cost — but only when needed

How to score confidence (cheapest → most accurate):

MethodCostAccuracy
Token log-prob averagefree (already computed)medium
Self-reported confidenceone extra small-model callmedium-high
Small judge modelone small-judge callhigh
Retrieval-overlap scoredepends on RAG stackhigh (RAG tasks)

Production stacks (RouteLLM, Anyscale, vLLM proxy plugins) wrap this. Custom logic is fine for 90% of cases.

Real-world saving: 3–8× cheaper average cost-per-query with no quality regression on a held-out eval. This is usually the biggest single win after prefix caching.

Lever 4 — Quantization

The 2026 default stack:

  • Weights: FP8 (E4M3 on H100/B200) or INT4 (AWQ/GPTQ for offline). Weight memory roughly halves (FP8) or quarters (INT4); inference throughput goes up commensurately because decode is bandwidth-bound.
  • KV cache: FP8. Halves the cache. Quality regression is negligible for chat / reasoning workloads.
  • Activations: BF16 or FP8. Weights and at FP8 with activations at BF16 is the sweet spot for most production stacks.

Quality regression on standard benchmarks (MMLU, GSM8K, HumanEval): typically under 0.5 points for FP8, 0.5–2 points for INT4. Both are well within the noise of model variation across runs. Validate on your eval — the regression is task-specific and concentrated in long-tail edge cases.

Throughput uplift on H100: ~1.7× going BF16 → FP8 on weights alone, ~2.5× on FP8 weights + FP8 KV. Multiply by your existing batch and the absolute throughput numbers get serious.

Lever 5 — Hardware

Once the software levers are pulled, hardware is the remaining axis. The honest 2026 comparison: B200 is ~2.5× the FP8 throughput of H100 at ~2.4× the on-demand price, so the per-token cost is roughly the same. The B200 win is consolidation: a 70B + 7B routing pair fits on a single B200 node where you’d previously need two H100 nodes — fewer hops, simpler ops. MI355X matches B200 throughput at meaningfully lower price, with 288 GB HBM that fits even a 405B with comfortable headroom. TPU v6 is competitive but locked to GCP and JAX.

The lever 5 trade isn’t “B200 makes my tokens cheaper” (it doesn’t, much, on demand); it’s “the right hardware reduces fleet count by 30–50%, which compounds the routing and quantization wins because both small and big models now share KV pools and node-local caches.” Confirm on your workload using the calculator below — vary params, tokens, MFU, GPU count.

Training cost · latency · energycanon: 2026-04-26
HardwareTimeCostEnergy$/hr · GPU
NVIDIA B2004,500 TF/s · 192 GB11.7 days$1.73M288.1 MWh$6.00
AMD MI355X5,000 TF/s · 288 GB10.5 daysfastest$1.43M363.0 MWh$5.50
Google TPU v6 (Trillium)4,500 TF/s · 96 GB11.7 days$1.15Mcheapest201.6 MWh$4.00
NVIDIA H100 (SXM5)1,979 TF/s · 80 GB26.7 days$1.64M458.5 MWh$2.50

A quick read of the table for typical 2026 frontier-class training: B200 wins on time, MI355X wins on memory headroom and often on $-per-FLOP, H100 only wins on availability. For inference workloads, the same shape applies with even bigger MI355X advantages because of HBM size.

Stacking the levers — a worked example

A team serves a customer-support chatbot. 1M requests/day, 3K-token system prompt, 500-token average user turn, 150-token average response. Naive baseline on a single 8×H100 node, BF16, vLLM v0.6: $8,400/day.

StepChangeMultiplierNew cost
0Baseline$8,400/day
1Move timestamps to end of prompt → 80% prefix cache hit÷ 2.4$3,500/day
2Migrate to vLLM v1 (chunked prefill on, default settings)÷ 1.5$2,330/day
3Route 70% of queries to a 7B small model first÷ 2.8$830/day
4FP8 weights + FP8 KV on the 70B fallback÷ 1.6$520/day
5Consolidate: small + big on one B200 node (8 GPUs → 5)÷ 1.5$350/day

Total: ~20–30× cheaper end-to-end. Numbers are illustrative but the order of magnitude and order of operations hold across most production workloads. Step 5 is roughly cost-neutral on a per-token basis — its real benefit is fleet size and operational simplicity, which compound when you start hosting the small + big routed pair together.

The non-obvious bit is how each step enables the next: prefix caching reduces the prefill load so the smaller batch produced by routing still fills the GPU; FP8 means the 70B fallback is fast enough to keep escalation latency in budget; B200’s extra memory means you can run the small + big models on the same node without splitting fleets.

Run it in your browser — try the levers on your numbers

Python — editableA 5-lever calculator. Plug in your daily traffic; see the stacking effect.
Ctrl+Enter to run

The shape is the lesson: each lever multiplies. The full stack is 20–30× cheaper than the baseline. No team that gets this right does it by accident.

Quick check

Fill in the blank
The first lever to pull on a fresh serving deployment, before any others:
It's free; you already paid for the GPU; it just requires getting the prompt template right.
Quick check
A team has a 70B chatbot at $5,000/day. They've enabled FP8 + B200 and are still over budget. Which step did they likely skip?

Key takeaways

  1. Five levers, in order: prefix caching → continuous batching → routing → quantization → hardware. Each multiplies the next.
  2. Prefix caching is free; track the hit rate. If it’s below 50%, the prompt template is the bug, not the GPU.
  3. Continuous batching = vLLM v1 default. If you’re not using it, migrate.
  4. Routing is the single biggest multiplier most teams miss. Small model first, big model on escalation.
  5. Hardware is last. Don’t go shopping for B200s until the software levers are pulled.

Go deeper

Prereqs: vLLM & SGLang, KV Cache Basics, Chunked Prefill. This lesson is the playbook — the moves you make once you understand the underlying systems.

TL;DR

  • Production LLM serving has five economic levers, in roughly increasing implementation cost: prefix caching → continuous batching → model routing → quantization → hardware. Each is multiplicative; doing all five well gets you 10–30× cheaper than naive serving.
  • Prefix caching is the cheapest win and most teams leave it on the table. Hit-rate-aware prompt design takes a day and pays off forever.
  • Continuous batching is on by default in vLLM v1, SGLang, TensorRT-LLM. Old TGI / HF transformers servers leave 5× on the floor — switch.
  • Model routing (small model first, escalate on uncertainty) is the highest-leverage lever for tasks where ~70% of queries are easy. Saves 3–8× on the dollar; no quality loss.
  • FP8 / INT4 quantization halves to quarters memory and bandwidth costs. Modern open weights take quantization with negligible quality regression.
  • Hardware is last because it’s a capex decision. Use the calculator below to compare B200 vs MI355X vs H100 for your workload.

Why this matters

LLM cost is the single biggest line item for any AI product that’s actually used. The first instinct — “throw more GPUs at it” — is the most expensive answer. Every team that achieves a sane cost-per-user has navigated a sequence of lever changes, in roughly the order above. The order matters: you don’t go shopping for B200s until you’ve verified prefix caching is on. Skipping levers means buying capacity that better software would have made unnecessary.

This lesson is the canonical 5-step playbook. Run it top to bottom on any new serving deployment.

Mental model

Each box, in order, is “the cheapest move I haven’t made yet.” Don’t skip ahead — every box is a multiplier on every later box.

Concrete walkthrough

Lever 1 — Prefix caching

See Prefix & RadixAttention for the mechanics. The actionable rule:

  • Move all variable bits to the end of the prompt. Timestamps, request IDs, user IDs, A/B feature flags — all at the bottom. Anything that mutates at offset 0 destroys cache sharing.
  • Pin a stable chat template. Tokenizer drift between client and server is the silent killer of hit rate.
  • Track prefix_cache_hit_rate in Grafana. If it’s below 50% on a chat workload, the prompt template is wrong; fix that before doing anything else.

Typical impact on TTFT: 3–10× on agent / chatbot / eval workloads, where ~80% of the prompt is shared.

Lever 2 — Continuous batching

The principle: do not finish one request before starting the next. The scheduler picks up new sequences mid-batch, evicts finished ones, runs one big packed forward pass per step. Combined with chunked prefill, every step is full of useful work.

This is on by default in vLLM v1, SGLang v0.4+, TensorRT-LLM. If you’re on legacy Hugging Face TGI or rolling your own with transformers, you’re paying ~5× more than you need to for decode throughput. Migrate.

Lever 3 — Model routing

Fact: most production traffic is easy. ~70% of customer-support questions, ~80% of code-completion calls, ~60% of summarization tasks can be answered by a 7B model that costs 1/30th of a 70B. The hard 20–40% need the big model.

The pattern:

# Pseudocode for the routing decision. def route(query): small_response = small_model.complete(query, max_tokens=64) confidence = score(small_response, query) # log-prob, judge model, retrieval-overlap, etc. if confidence > THRESHOLD: return small_response # ~70% of traffic, ~3% of cost return big_model.complete(query) # ~30% of traffic, ~97% of cost — but only when needed

How to score confidence (cheapest → most accurate):

MethodCostAccuracy
Token log-prob averagefree (already computed)medium
Self-reported confidenceone extra small-model callmedium-high
Small judge modelone small-judge callhigh
Retrieval-overlap scoredepends on RAG stackhigh (RAG tasks)

Production stacks (RouteLLM, Anyscale, vLLM proxy plugins) wrap this. Custom logic is fine for 90% of cases.

Real-world saving: 3–8× cheaper average cost-per-query with no quality regression on a held-out eval. This is usually the biggest single win after prefix caching.

Lever 4 — Quantization

The 2026 default stack:

  • Weights: FP8 (E4M3 on H100/B200) or INT4 (AWQ/GPTQ for offline). Weight memory roughly halves (FP8) or quarters (INT4); inference throughput goes up commensurately because decode is bandwidth-bound.
  • KV cache: FP8. Halves the cache. Quality regression is negligible for chat / reasoning workloads.
  • Activations: BF16 or FP8. Weights and KV at FP8 with activations at BF16 is the sweet spot for most production stacks.

Quality regression on standard benchmarks (MMLU, GSM8K, HumanEval): typically under 0.5 points for FP8, 0.5–2 points for INT4. Both are well within the noise of model variation across runs. Validate on your eval — the regression is task-specific and concentrated in long-tail edge cases.

Throughput uplift on H100: ~1.7× going BF16 → FP8 on weights alone, ~2.5× on FP8 weights + FP8 KV. Multiply by your existing batch and the absolute throughput numbers get serious.

Lever 5 — Hardware

Once the software levers are pulled, hardware is the remaining axis. The honest 2026 comparison: B200 is ~2.5× the FP8 throughput of H100 at ~2.4× the on-demand price, so the per-token cost is roughly the same. The B200 win is consolidation: a 70B + 7B routing pair fits on a single B200 node where you’d previously need two H100 nodes — fewer hops, simpler ops. MI355X matches B200 throughput at meaningfully lower price, with 288 GB HBM that fits even a 405B with comfortable headroom. TPU v6 is competitive but locked to GCP and JAX.

The lever 5 trade isn’t “B200 makes my tokens cheaper” (it doesn’t, much, on demand); it’s “the right hardware reduces fleet count by 30–50%, which compounds the routing and quantization wins because both small and big models now share KV pools and node-local caches.” Confirm on your workload using the calculator below — vary params, tokens, MFU, GPU count.

Training cost · latency · energycanon: 2026-04-26
HardwareTimeCostEnergy$/hr · GPU
NVIDIA B2004,500 TF/s · 192 GB11.7 days$1.73M288.1 MWh$6.00
AMD MI355X5,000 TF/s · 288 GB10.5 daysfastest$1.43M363.0 MWh$5.50
Google TPU v6 (Trillium)4,500 TF/s · 96 GB11.7 days$1.15Mcheapest201.6 MWh$4.00
NVIDIA H100 (SXM5)1,979 TF/s · 80 GB26.7 days$1.64M458.5 MWh$2.50

A quick read of the table for typical 2026 frontier-class training: B200 wins on time, MI355X wins on memory headroom and often on $-per-FLOP, H100 only wins on availability. For inference workloads, the same shape applies with even bigger MI355X advantages because of HBM size.

Stacking the levers — a worked example

A team serves a customer-support chatbot. 1M requests/day, 3K-token system prompt, 500-token average user turn, 150-token average response. Naive baseline on a single 8×H100 node, BF16, vLLM v0.6: $8,400/day.

StepChangeMultiplierNew cost
0Baseline$8,400/day
1Move timestamps to end of prompt → 80% prefix cache hit÷ 2.4$3,500/day
2Migrate to vLLM v1 (chunked prefill on, default settings)÷ 1.5$2,330/day
3Route 70% of queries to a 7B small model first÷ 2.8$830/day
4FP8 weights + FP8 KV on the 70B fallback÷ 1.6$520/day
5Consolidate: small + big on one B200 node (8 GPUs → 5)÷ 1.5$350/day

Total: ~20–30× cheaper end-to-end. Numbers are illustrative but the order of magnitude and order of operations hold across most production workloads. Step 5 is roughly cost-neutral on a per-token basis — its real benefit is fleet size and operational simplicity, which compound when you start hosting the small + big routed pair together.

The non-obvious bit is how each step enables the next: prefix caching reduces the prefill load so the smaller batch produced by routing still fills the GPU; FP8 means the 70B fallback is fast enough to keep escalation latency in budget; B200’s extra memory means you can run the small + big models on the same node without splitting fleets.

Run it in your browser — try the levers on your numbers

Python — editableA 5-lever calculator. Plug in your daily traffic; see the stacking effect.
Ctrl+Enter to run

The shape is the lesson: each lever multiplies. The full stack is 20–30× cheaper than the baseline. No team that gets this right does it by accident.

Quick check

Fill in the blank
The first lever to pull on a fresh serving deployment, before any others:
It's free; you already paid for the GPU; it just requires getting the prompt template right.
Quick check
A team has a 70B chatbot at $5,000/day. They've enabled FP8 + B200 and are still over budget. Which step did they likely skip?

Key takeaways

  1. Five levers, in order: prefix caching → continuous batching → routing → quantization → hardware. Each multiplies the next.
  2. Prefix caching is free; track the hit rate. If it’s below 50%, the prompt template is the bug, not the GPU.
  3. Continuous batching = vLLM v1 default. If you’re not using it, migrate.
  4. Routing is the single biggest multiplier most teams miss. Small model first, big model on escalation.
  5. Hardware is last. Don’t go shopping for B200s until the software levers are pulled.

Go deeper