Cost & Latency
Prereqs: vLLM & SGLang, KV Cache Basics, Chunked Prefill. This lesson is the playbook — the moves you make once you understand the underlying systems.
A customer-support chatbot runs a 70B model on 8 H100s, serves 1M requests a day, and the bill arrives: 350/day. That’s 24× cheaper, no quality regression, no new hardware order, no team rewrite. The levers, in the order you pull them: → → model routing → → hardware. Each one multiplies the next, which is why the ordering matters: pulling lever 5 (B200s) before lever 1 (prompt template fixes) means buying capacity that better software would have made unnecessary.
This lesson is the canonical 5-step playbook. The numbers above are illustrative but the order of magnitude and order of operations hold across most production workloads. Latency follows the same logic on different SLOs — in the 200–800 ms band, in the 30–80 ms band — and the same five levers move both. Run the playbook top to bottom on any new serving deployment.
TL;DR
- Production LLM serving has five economic levers, in roughly increasing implementation cost: prefix caching → continuous batching → model routing → quantization → hardware. Each is multiplicative; doing all five well gets you 10–30× cheaper than naive serving.
- Prefix caching is the cheapest win and most teams leave it on the table. Hit-rate-aware prompt design takes a day and pays off forever.
- Continuous batching is on by default in vLLM v1, SGLang, TensorRT-LLM. Old TGI / HF transformers servers leave 5× on the floor — switch.
- Model routing (small model first, escalate on uncertainty) is the highest-leverage lever for tasks where ~70% of queries are easy. Saves 3–8× on the dollar; no quality loss.
- FP8 / INT4 quantization halves to quarters memory and bandwidth costs. Modern open weights take quantization with negligible quality regression.
- Hardware is last because it’s a capex decision. Use the calculator below to compare B200 vs MI355X vs H100 for your workload.
Mental model
Each box, in order, is “the cheapest move I haven’t made yet.” Don’t skip ahead — every box is a multiplier on every later box.
Lever 1 — Prefix caching
See Prefix & RadixAttention for the mechanics. The actionable rule:
- Move all variable bits to the end of the prompt. Timestamps, request IDs, user IDs, A/B feature flags — all at the bottom. Anything that mutates at offset 0 destroys cache sharing.
- Pin a stable chat template. Tokenizer drift between client and server is the silent killer of hit rate.
- Track
prefix_cache_hit_ratein Grafana. If it’s below 50% on a chat workload, the prompt template is wrong; fix that before doing anything else.
Typical impact on TTFT: 3–10× on agent / chatbot / eval workloads, where ~80% of the prompt is shared.
Lever 2 — Continuous batching
The principle: do not finish one request before starting the next. The scheduler picks up new sequences mid-batch, evicts finished ones, runs one big packed forward pass per step. Combined with chunked prefill, every step is full of useful work.
This is on by default in v1, v0.4+, TensorRT-LLM. If you’re on legacy Hugging Face TGI or rolling your own with transformers, you’re paying ~5× more than you need to for decode throughput. Migrate.
Lever 3 — Model routing
Fact: most production traffic is easy. ~70% of customer-support questions, ~80% of code-completion calls, ~60% of summarization tasks can be answered by a 7B model that costs 1/30th of a 70B. The hard 20–40% need the big model.
The pattern:
# Pseudocode for the routing decision.
def route(query):
small_response = small_model.complete(query, max_tokens=64)
confidence = score(small_response, query) # log-prob, judge model, retrieval-overlap, etc.
if confidence > THRESHOLD:
return small_response # ~70% of traffic, ~3% of cost
return big_model.complete(query) # ~30% of traffic, ~97% of cost — but only when neededHow to score confidence (cheapest → most accurate):
| Method | Cost | Accuracy |
|---|---|---|
| Token log-prob average | free (already computed) | medium |
| Self-reported confidence | one extra small-model call | medium-high |
| Small judge model | one small-judge call | high |
| Retrieval-overlap score | depends on RAG stack | high (RAG tasks) |
Production stacks (RouteLLM, Anyscale, vLLM proxy plugins) wrap this. Custom logic is fine for 90% of cases.
Real-world saving: 3–8× cheaper average cost-per-query with no quality regression on a held-out eval. This is usually the biggest single win after prefix caching.
Lever 4 — Quantization
The 2026 default stack:
- Weights: FP8 (E4M3 on H100/B200) or INT4 (AWQ/GPTQ for offline). Weight memory roughly halves (FP8) or quarters (INT4); inference throughput goes up commensurately because decode is bandwidth-bound.
- KV cache: FP8. Halves the cache. Quality regression is negligible for chat / reasoning workloads.
- Activations: BF16 or FP8. Weights and at FP8 with activations at BF16 is the sweet spot for most production stacks.
Quality regression on standard benchmarks (MMLU, GSM8K, HumanEval): typically under 0.5 points for FP8, 0.5–2 points for INT4. Both are well within the noise of model variation across runs. Validate on your eval — the regression is task-specific and concentrated in long-tail edge cases.
Throughput uplift on H100: ~1.7× going BF16 → FP8 on weights alone, ~2.5× on FP8 weights + FP8 KV. Multiply by your existing batch and the absolute throughput numbers get serious.
Lever 5 — Hardware
Once the software levers are pulled, hardware is the remaining axis. The honest 2026 comparison: B200 is ~2.5× the FP8 throughput of H100 at ~2.4× the on-demand price, so the per-token cost is roughly the same. The B200 win is consolidation: a 70B + 7B routing pair fits on a single B200 node where you’d previously need two H100 nodes — fewer hops, simpler ops. MI355X matches B200 throughput at meaningfully lower price, with 288 GB HBM that fits even a 405B with comfortable headroom. TPU v6 is competitive but locked to GCP and JAX.
The lever 5 trade isn’t “B200 makes my tokens cheaper” (it doesn’t, much, on demand); it’s “the right hardware reduces fleet count by 30–50%, which compounds the routing and quantization wins because both small and big models now share KV pools and node-local caches.” Confirm on your workload using the calculator below — vary params, tokens, MFU, GPU count.
| Hardware | Time | Cost | Energy | $/hr · GPU |
|---|---|---|---|---|
| NVIDIA B200 | 11.7 days | $1.73M | 288.1 MWh | $6.00 |
| AMD MI355X | 10.5 daysfastest | $1.43M | 363.0 MWh | $5.50 |
| Google TPU v6 (Trillium) | 11.7 days | $1.15Mcheapest | 201.6 MWh | $4.00 |
| NVIDIA H100 (SXM5) | 26.7 days | $1.64M | 458.5 MWh | $2.50 |
A quick read of the table for typical 2026 frontier-class training: B200 wins on time, MI355X wins on memory headroom and often on $-per-FLOP, H100 only wins on availability. For inference workloads, the same shape applies with even bigger MI355X advantages because of HBM size.
Stacking the levers — a worked example
A team serves a customer-support chatbot. 1M requests/day, 3K-token system prompt, 500-token average user turn, 150-token average response. Naive baseline on a single 8×H100 node, BF16, vLLM v0.6: $8,400/day.
| Step | Change | Multiplier | New cost |
|---|---|---|---|
| 0 | Baseline | 1× | $8,400/day |
| 1 | Move timestamps to end of prompt → 80% prefix cache hit | ÷ 2.4 | $3,500/day |
| 2 | Migrate to vLLM v1 (chunked prefill on, default settings) | ÷ 1.5 | $2,330/day |
| 3 | Route 70% of queries to a 7B small model first | ÷ 2.8 | $830/day |
| 4 | FP8 weights + FP8 KV on the 70B fallback | ÷ 1.6 | $520/day |
| 5 | Consolidate: small + big on one B200 node (8 GPUs → 5) | ÷ 1.5 | $350/day |
Total: ~20–30× cheaper end-to-end. Numbers are illustrative but the order of magnitude and order of operations hold across most production workloads. Step 5 is roughly cost-neutral on a per-token basis — its real benefit is fleet size and operational simplicity, which compound when you start hosting the small + big routed pair together.
The non-obvious bit is how each step enables the next: prefix caching reduces the prefill load so the smaller batch produced by routing still fills the GPU; FP8 means the 70B fallback is fast enough to keep escalation latency in budget; B200’s extra memory means you can run the small + big models on the same node without splitting fleets.
Run it in your browser — try the levers on your numbers
The shape is the lesson: each lever multiplies. The full stack is 20–30× cheaper than the baseline. No team that gets this right does it by accident.
Quick check
Key takeaways
- Five levers, in order: prefix caching → continuous batching → routing → quantization → hardware. Each multiplies the next.
- Prefix caching is free; track the hit rate. If it’s below 50%, the prompt template is the bug, not the GPU.
- Continuous batching = vLLM v1 default. If you’re not using it, migrate.
- Routing is the single biggest multiplier most teams miss. Small model first, big model on escalation.
- Hardware is last. Don’t go shopping for B200s until the software levers are pulled.
Go deeper
- PaperRouteLLM: Learning to Route LLMs with Preference DataThe systematic study of model routing. Section 4 has the cost-quality Pareto curves.
- PaperGPTQ: Accurate Post-Training Quantization for Generative Pre-trained TransformersINT4 weight quantization; the recipe most production stacks ship.
- PaperAWQ: Activation-aware Weight QuantizationThe AWQ recipe — better at low-bit than GPTQ for many models.
- BlogvLLM v0.6 Performance UpdateReal numbers on the throughput effects of the various levers, on real models.
- BlogAnyscale — LLM Routing in ProductionAuthoritative production case study on routing economics.
- DocsvLLM — QuantizationProduction knobs for FP8, INT4, AWQ, GPTQ. The exact CLI flags.
- Repolm-sys/RouteLLMReference router implementation. Read `routellm/routers/` for the routing models.
Prereqs: vLLM & SGLang, KV Cache Basics, Chunked Prefill. This lesson is the playbook — the moves you make once you understand the underlying systems.
TL;DR
- Production LLM serving has five economic levers, in roughly increasing implementation cost: prefix caching → continuous batching → model routing → quantization → hardware. Each is multiplicative; doing all five well gets you 10–30× cheaper than naive serving.
- Prefix caching is the cheapest win and most teams leave it on the table. Hit-rate-aware prompt design takes a day and pays off forever.
- Continuous batching is on by default in vLLM v1, SGLang, TensorRT-LLM. Old TGI / HF transformers servers leave 5× on the floor — switch.
- Model routing (small model first, escalate on uncertainty) is the highest-leverage lever for tasks where ~70% of queries are easy. Saves 3–8× on the dollar; no quality loss.
- FP8 / INT4 quantization halves to quarters memory and bandwidth costs. Modern open weights take quantization with negligible quality regression.
- Hardware is last because it’s a capex decision. Use the calculator below to compare B200 vs MI355X vs H100 for your workload.
Why this matters
LLM cost is the single biggest line item for any AI product that’s actually used. The first instinct — “throw more GPUs at it” — is the most expensive answer. Every team that achieves a sane cost-per-user has navigated a sequence of lever changes, in roughly the order above. The order matters: you don’t go shopping for B200s until you’ve verified prefix caching is on. Skipping levers means buying capacity that better software would have made unnecessary.
This lesson is the canonical 5-step playbook. Run it top to bottom on any new serving deployment.
Mental model
Each box, in order, is “the cheapest move I haven’t made yet.” Don’t skip ahead — every box is a multiplier on every later box.
Concrete walkthrough
Lever 1 — Prefix caching
See Prefix & RadixAttention for the mechanics. The actionable rule:
- Move all variable bits to the end of the prompt. Timestamps, request IDs, user IDs, A/B feature flags — all at the bottom. Anything that mutates at offset 0 destroys cache sharing.
- Pin a stable chat template. Tokenizer drift between client and server is the silent killer of hit rate.
- Track
prefix_cache_hit_ratein Grafana. If it’s below 50% on a chat workload, the prompt template is wrong; fix that before doing anything else.
Typical impact on TTFT: 3–10× on agent / chatbot / eval workloads, where ~80% of the prompt is shared.
Lever 2 — Continuous batching
The principle: do not finish one request before starting the next. The scheduler picks up new sequences mid-batch, evicts finished ones, runs one big packed forward pass per step. Combined with chunked prefill, every step is full of useful work.
This is on by default in vLLM v1, SGLang v0.4+, TensorRT-LLM. If you’re on legacy Hugging Face TGI or rolling your own with transformers, you’re paying ~5× more than you need to for decode throughput. Migrate.
Lever 3 — Model routing
Fact: most production traffic is easy. ~70% of customer-support questions, ~80% of code-completion calls, ~60% of summarization tasks can be answered by a 7B model that costs 1/30th of a 70B. The hard 20–40% need the big model.
The pattern:
# Pseudocode for the routing decision.
def route(query):
small_response = small_model.complete(query, max_tokens=64)
confidence = score(small_response, query) # log-prob, judge model, retrieval-overlap, etc.
if confidence > THRESHOLD:
return small_response # ~70% of traffic, ~3% of cost
return big_model.complete(query) # ~30% of traffic, ~97% of cost — but only when neededHow to score confidence (cheapest → most accurate):
| Method | Cost | Accuracy |
|---|---|---|
| Token log-prob average | free (already computed) | medium |
| Self-reported confidence | one extra small-model call | medium-high |
| Small judge model | one small-judge call | high |
| Retrieval-overlap score | depends on RAG stack | high (RAG tasks) |
Production stacks (RouteLLM, Anyscale, vLLM proxy plugins) wrap this. Custom logic is fine for 90% of cases.
Real-world saving: 3–8× cheaper average cost-per-query with no quality regression on a held-out eval. This is usually the biggest single win after prefix caching.
Lever 4 — Quantization
The 2026 default stack:
- Weights: FP8 (E4M3 on H100/B200) or INT4 (AWQ/GPTQ for offline). Weight memory roughly halves (FP8) or quarters (INT4); inference throughput goes up commensurately because decode is bandwidth-bound.
- KV cache: FP8. Halves the cache. Quality regression is negligible for chat / reasoning workloads.
- Activations: BF16 or FP8. Weights and KV at FP8 with activations at BF16 is the sweet spot for most production stacks.
Quality regression on standard benchmarks (MMLU, GSM8K, HumanEval): typically under 0.5 points for FP8, 0.5–2 points for INT4. Both are well within the noise of model variation across runs. Validate on your eval — the regression is task-specific and concentrated in long-tail edge cases.
Throughput uplift on H100: ~1.7× going BF16 → FP8 on weights alone, ~2.5× on FP8 weights + FP8 KV. Multiply by your existing batch and the absolute throughput numbers get serious.
Lever 5 — Hardware
Once the software levers are pulled, hardware is the remaining axis. The honest 2026 comparison: B200 is ~2.5× the FP8 throughput of H100 at ~2.4× the on-demand price, so the per-token cost is roughly the same. The B200 win is consolidation: a 70B + 7B routing pair fits on a single B200 node where you’d previously need two H100 nodes — fewer hops, simpler ops. MI355X matches B200 throughput at meaningfully lower price, with 288 GB HBM that fits even a 405B with comfortable headroom. TPU v6 is competitive but locked to GCP and JAX.
The lever 5 trade isn’t “B200 makes my tokens cheaper” (it doesn’t, much, on demand); it’s “the right hardware reduces fleet count by 30–50%, which compounds the routing and quantization wins because both small and big models now share KV pools and node-local caches.” Confirm on your workload using the calculator below — vary params, tokens, MFU, GPU count.
| Hardware | Time | Cost | Energy | $/hr · GPU |
|---|---|---|---|---|
| NVIDIA B200 | 11.7 days | $1.73M | 288.1 MWh | $6.00 |
| AMD MI355X | 10.5 daysfastest | $1.43M | 363.0 MWh | $5.50 |
| Google TPU v6 (Trillium) | 11.7 days | $1.15Mcheapest | 201.6 MWh | $4.00 |
| NVIDIA H100 (SXM5) | 26.7 days | $1.64M | 458.5 MWh | $2.50 |
A quick read of the table for typical 2026 frontier-class training: B200 wins on time, MI355X wins on memory headroom and often on $-per-FLOP, H100 only wins on availability. For inference workloads, the same shape applies with even bigger MI355X advantages because of HBM size.
Stacking the levers — a worked example
A team serves a customer-support chatbot. 1M requests/day, 3K-token system prompt, 500-token average user turn, 150-token average response. Naive baseline on a single 8×H100 node, BF16, vLLM v0.6: $8,400/day.
| Step | Change | Multiplier | New cost |
|---|---|---|---|
| 0 | Baseline | 1× | $8,400/day |
| 1 | Move timestamps to end of prompt → 80% prefix cache hit | ÷ 2.4 | $3,500/day |
| 2 | Migrate to vLLM v1 (chunked prefill on, default settings) | ÷ 1.5 | $2,330/day |
| 3 | Route 70% of queries to a 7B small model first | ÷ 2.8 | $830/day |
| 4 | FP8 weights + FP8 KV on the 70B fallback | ÷ 1.6 | $520/day |
| 5 | Consolidate: small + big on one B200 node (8 GPUs → 5) | ÷ 1.5 | $350/day |
Total: ~20–30× cheaper end-to-end. Numbers are illustrative but the order of magnitude and order of operations hold across most production workloads. Step 5 is roughly cost-neutral on a per-token basis — its real benefit is fleet size and operational simplicity, which compound when you start hosting the small + big routed pair together.
The non-obvious bit is how each step enables the next: prefix caching reduces the prefill load so the smaller batch produced by routing still fills the GPU; FP8 means the 70B fallback is fast enough to keep escalation latency in budget; B200’s extra memory means you can run the small + big models on the same node without splitting fleets.
Run it in your browser — try the levers on your numbers
The shape is the lesson: each lever multiplies. The full stack is 20–30× cheaper than the baseline. No team that gets this right does it by accident.
Quick check
Key takeaways
- Five levers, in order: prefix caching → continuous batching → routing → quantization → hardware. Each multiplies the next.
- Prefix caching is free; track the hit rate. If it’s below 50%, the prompt template is the bug, not the GPU.
- Continuous batching = vLLM v1 default. If you’re not using it, migrate.
- Routing is the single biggest multiplier most teams miss. Small model first, big model on escalation.
- Hardware is last. Don’t go shopping for B200s until the software levers are pulled.
Go deeper
- PaperRouteLLM: Learning to Route LLMs with Preference DataThe systematic study of model routing. Section 4 has the cost-quality Pareto curves.
- PaperGPTQ: Accurate Post-Training Quantization for Generative Pre-trained TransformersINT4 weight quantization; the recipe most production stacks ship.
- PaperAWQ: Activation-aware Weight QuantizationThe AWQ recipe — better at low-bit than GPTQ for many models.
- BlogvLLM v0.6 Performance UpdateReal numbers on the throughput effects of the various levers, on real models.
- BlogAnyscale — LLM Routing in ProductionAuthoritative production case study on routing economics.
- DocsvLLM — QuantizationProduction knobs for FP8, INT4, AWQ, GPTQ. The exact CLI flags.
- Repolm-sys/RouteLLMReference router implementation. Read `routellm/routers/` for the routing models.