FP8 Inference
For most of GPU history, “quantization” meant INT8 — pack each value into a signed 8-bit integer with one scale factor per tensor, hope the dynamic range was tight enough, and accept the calibration mess that came with it. INT8 worked; nobody loved it.
Then Hopper shipped FP8. Specifically two variants — E4M3 and E5M2 — that look like tiny floating-point numbers (1 sign bit, a few exponent bits, a few mantissa bits) instead of integers. The exponent gives them dynamic range without per-tensor calibration tricks, the tensor cores execute them natively at 2× the BF16 throughput, and the accuracy is meaningfully better than INT8 at the same 8 bits.
Today FP8 is the default 2026 production-inference quantization. On H100, BF16 → FP8 weights + KV cache halves your HBM traffic, gets you ~1.7× decode throughput, and typically costs less than 0.5 points of MMLU. If you’re not running FP8 in 2026 inference, you’re paying ~2× more than necessary. This lesson is what FP8 actually is, when to pick E4M3 vs E5M2, and how per-tensor scaling makes the trick work.
TL;DR
- FP8 is an 8-bit floating-point format (sign + exponent + mantissa). Two variants: E4M3 (4 exponent bits, 3 mantissa, range ~ ±448, used for weights/activations) and E5M2 (5 exponent bits, 2 mantissa, range ~ ±57344, used for gradients during training).
- Replaced INT8 as the standard quantization format on H100+ because: it preserves exponent flexibility (no per-tensor calibration mess), tensor cores natively support it, and accuracy is meaningfully better at the same bit-width.
- Hopper added FP8 tensor cores; Blackwell added FP4 / FP6 alongside. All modern AI compilers (CUTLASS, Triton, vLLM, FlashAttention-3) support FP8 weights + activations + KV cache.
- Quality regression from BF16 → FP8 is typically under 0.5 points on standard benchmarks (MMLU, GSM8K). KV-cache FP8 quantization adds ~0 measurable regression for most models.
- Per-tensor scaling is the magic — the format itself has limited range, but each tensor (or each block) carries a scale factor so the actual values can span any magnitude. The scale is stored in higher precision (BF16 or FP32).
Mental model
The scale carries the dynamic range; FP8 carries the values. Together you get most of FP16’s precision at half the bytes.
What FP8 actually is, bit-for-bit
A float has three pieces: 1 sign bit, E exponent bits (with bias), M mantissa bits. Examples:
| Format | Sign | Exp | Mantissa | Bias | Max value | Min positive |
|---|---|---|---|---|---|---|
| FP32 | 1 | 8 | 23 | 127 | ~3.4e38 | ~1.2e-38 |
| FP16 | 1 | 5 | 10 | 15 | 65504 | ~6e-5 |
| BF16 | 1 | 8 | 7 | 127 | ~3.4e38 | ~1.2e-38 |
| E4M3 (FN) | 1 | 4 | 3 | 7 | 448 | 2^-9 ≈ 0.002 |
| E5M2 | 1 | 5 | 2 | 15 | 57344 | 2^-16 |
Two important caveats on the FP8 row:
- E4M3 here is “E4M3FN” — the OCP / NVIDIA variant where the all-ones exponent encodes finite values + a single NaN, rather than IEEE-style ±∞. That extra encoding slot is why max = 448 (not 240, which is the strict-IEEE max). The H100, B200, and MI300 all use E4M3FN.
- E5M2 follows IEEE rules: bias 15, ±∞ at all-ones exponent, max finite = 57344. So max = 57344 excludes the inf encoding; the all-ones encoding itself is reserved.
Key observations:
- E4M3 has higher precision (3 mantissa bits → ~7 representable values per power of 2) but tiny range (±448).
- E5M2 has wider range (±57344, like FP16) but lower precision (only 2 mantissa bits → 4 values per power of 2).
- Both are way coarser than BF16. Direct conversion of an arbitrary BF16 tensor would clip and round badly.
The fix: per-tensor (or per-block) scaling.
Per-tensor scaling
Store one BF16 (or FP32) scalar S per tensor. The actual value of a weight is S * fp8_value. To pick S:
- Find the absolute max value
Macross the tensor. - Set
S = M / 448(so the max value, when divided by S, fits in E4M3’s range). - For each weight
w, storequantize(w / S)as FP8.
Memory cost: 1 scalar per tensor, negligible. Quality cost: each value is rounded to its nearest E4M3 representable, but because the dynamic range now matches the tensor’s, no clipping happens and most rounding error is below 0.5%.
For very large tensors, block scaling (one scalar per 32 or 128 elements) trades a bit more memory for tighter range matching → better accuracy. CUTLASS 4 and DeepSeek-V3 both use this.
# Sketch of FP8 quantization
def quantize_fp8_e4m3(w_bf16, block_size=128):
# w_bf16 shape: (out, in)
out, inn = w_bf16.shape
n_blocks = inn // block_size
w_blocks = w_bf16.view(out, n_blocks, block_size)
abs_max = w_blocks.abs().amax(dim=-1, keepdim=True)
scales = abs_max / 448.0 # shape: (out, n_blocks, 1) in BF16
w_fp8 = (w_blocks / scales).to(torch.float8_e4m3fn) # cast to FP8
return w_fp8, scales
# At inference: dequantize via scales, then multiply.
def matmul_fp8(x_fp8, x_scales, w_fp8, w_scales):
# On Hopper this is one fused mma.sync; we sketch the math:
return (x_fp8 * x_scales) @ (w_fp8 * w_scales)The Python above is a math sketch — it’s not what runs at inference. In production, FP8 matmul executes as a single hardware instruction issued from CUDA C++:
// Inside a Hopper kernel — one warpgroup matmul on FP8 inputs, FP32 accumulator.
// 64×128×16 tile per instruction. Scales applied via a fused epilogue.
asm volatile(
"wgmma.mma_async.sync.aligned.m64n128k16.f32"
".f8e4m3.f8e4m3 " // FP8 E4M3 inputs
"{%0,%1,%2,%3,%4,%5,%6,%7}," // 8 output FP32 fragments
" %8, %9, p, 1, 1;\n" // operand descriptors + scale
: "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3]),
"+f"(d[4]), "+f"(d[5]), "+f"(d[6]), "+f"(d[7])
: "l"(a_desc), "l"(b_desc), "r"(p_state)
);CUTLASS / CUTE wraps this in a typed C++ template; you rarely write the inline PTX yourself. But it’s worth seeing once — every FP8 inference kernel ultimately reduces to instructions of this shape.
When to use E4M3 vs E5M2
| Scenario | Format | Why |
|---|---|---|
| Inference weights | E4M3 | Higher precision; weight magnitudes are bounded |
| Inference activations | E4M3 | Same |
| Inference KV cache | E4M3 (or E5M2 for very long context) | Tradeoff |
| Training: gradients (post-AdamW) | E5M2 | Wider range; gradients can have outliers |
| Training: optimizer state | BF16 / FP32 | Don’t quantize the running averages |
Frontier training (DeepSeek-V3, Llama-4 reportedly) does mixed FP8 training: weights and activations in E4M3, gradients in E5M2, master weights in FP32, AdamW state in BF16. This is the recipe for FP8 training without losing accuracy at scale.
KV cache in FP8
The KV cache is a separate quantization opportunity. A 70B model at 32K context with batch 16 has ~80 GB of KV in BF16; halve that with FP8 and you fit twice as many concurrent users in the same VRAM.
vLLM v1 supports kv_cache_dtype="fp8" as a one-flag switch:
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", kv_cache_dtype="fp8")Quality regression on chat / reasoning benchmarks: typically negligible. Long-context retrieval may regress slightly more. Worth A/B-testing on your eval but rarely a blocker.
What it costs to actually run FP8 GEMMs
The peak FP8 throughput numbers from spec sheets:
| GPU | BF16 TFLOPs | FP8 TFLOPs | Effective speedup (decode) |
|---|---|---|---|
| A100 | 312 | n/a | n/a |
| H100 | 989 | 1979 | ~1.7× |
| H200 | 989 | 1979 | ~2.0× (more HBM) |
| B200 | 2250 | 4500 | ~2.5× |
| MI300X | 1307 | 2614 | ~1.7× |
| MI355X | 2400 | 5000 | ~2.4× |
Decode is bandwidth-bound, not compute-bound, so the effective speedup is closer to “bytes saved by FP8” (~2×) than the headline 2.5×. Either way the win is real.
When FP8 hurts
Three failure modes to know:
- Outlier-heavy activations. Some channels (especially in older models, or fine-tuned ones) have outlier activations that overflow E4M3’s range. The fix is rotation quantization (next lesson) or per-channel scaling.
- Long-tail accuracy. Aggregate benchmarks barely move; specific niche tasks (e.g., low-resource translation, detailed math reasoning) sometimes regress 1–2 points. Validate on your eval.
- Old hardware. Pre-Hopper GPUs don’t have FP8 tensor cores; you’d be doing FP8 → BF16 dequantization in software, which is slower than just using BF16. Not a useful target.
For most production workloads in 2026, FP8 is the default and these are edge cases. But knowing they exist lets you debug the rare regression.
Run it in your browser — FP8 quantize / dequantize
You’ll typically see per-tensor scaling at ~3–5% mean relative error and per-block scaling at ~1–2%. That extra precision is what lets FP8 maintain quality at scale.
Quick check
Key takeaways
- FP8 = floating-point at 8 bits, with per-tensor or per-block scaling. E4M3 for weights/activations, E5M2 for gradients.
- Tensor-core native on Hopper+ —
wgmma.fp8.fp8.f32is the production instruction. - ~1.7–2× decode throughput from BF16 → FP8 weights + KV; quality regression typically below 0.5 pts on standard benchmarks.
- Block scaling beats per-tensor. Frontier models use block_size=32 or 128.
- Watch out for outlier-heavy activations — fix with rotation quantization or skip FP8 on those layers.
Go deeper
- PaperFP8 Formats for Deep LearningThe original NVIDIA FP8 paper. Section 3 has the format design rationale.
- PaperDeepSeek-V3 Technical ReportSection 3 details the production FP8 training recipe at 671B / 14.8T-token scale. The clearest worked example.
- DocsNVIDIA Transformer Engine — FP8 DocumentationHow NVIDIA does it in production. Sections on delayed scaling and the autocast API are essential.
- DocsvLLM — FP8 InferenceProduction knobs and observed throughput on real models.
- BlogPyTorch — FP8 on HopperPyTorch's native FP8 support. Includes the float8_e4m3fn dtype and the autocast hooks.
- RepoNVIDIA/TransformerEngineReference impl. `transformer_engine/pytorch/fp8.py` is the autocast layer; the `cpp/` side has the C++ kernels.
TL;DR
- FP8 is an 8-bit floating-point format (sign + exponent + mantissa). Two variants: E4M3 (4 exponent bits, 3 mantissa, range ~ ±448, used for weights/activations) and E5M2 (5 exponent bits, 2 mantissa, range ~ ±57344, used for gradients during training).
- Replaced INT8 as the standard quantization format on H100+ because: it preserves exponent flexibility (no per-tensor calibration mess), tensor cores natively support it, and accuracy is meaningfully better at the same bit-width.
- Hopper added FP8 tensor cores; Blackwell added FP4 / FP6 alongside. All modern AI compilers (CUTLASS, Triton, vLLM, FlashAttention-3) support FP8 weights + activations + KV cache.
- Quality regression from BF16 → FP8 is typically under 0.5 points on standard benchmarks (MMLU, GSM8K). KV-cache FP8 quantization adds ~0 measurable regression for most models.
- Per-tensor scaling is the magic — the format itself has limited range, but each tensor (or each block) carries a scale factor so the actual values can span any magnitude. The scale is stored in higher precision (BF16 or FP32).
Why this matters
FP8 is the single biggest practical compression win in 2024–2026 production inference. Halve the bytes per weight (vs BF16), halve the HBM traffic, ~1.7× decode throughput on H100 — at near-zero quality cost. Every modern serving stack (vLLM v1, SGLang, TensorRT-LLM) defaults to FP8 weights + FP8 KV when the hardware supports it. If you’re not using FP8 in 2026 inference, you’re paying ~2× more than necessary. Knowing how it works under the hood is what lets you tune the per-tensor scaling, debug occasional accuracy regressions, and pick between E4M3 and E5M2.
Mental model
The scale carries the dynamic range; FP8 carries the values. Together you get most of FP16’s precision at half the bytes.
Concrete walkthrough
What FP8 actually is, bit-for-bit
A float has three pieces: 1 sign bit, E exponent bits (with bias), M mantissa bits. Examples:
| Format | Sign | Exp | Mantissa | Bias | Max value | Min positive |
|---|---|---|---|---|---|---|
| FP32 | 1 | 8 | 23 | 127 | ~3.4e38 | ~1.2e-38 |
| FP16 | 1 | 5 | 10 | 15 | 65504 | ~6e-5 |
| BF16 | 1 | 8 | 7 | 127 | ~3.4e38 | ~1.2e-38 |
| E4M3 (FN) | 1 | 4 | 3 | 7 | 448 | 2^-9 ≈ 0.002 |
| E5M2 | 1 | 5 | 2 | 15 | 57344 | 2^-16 |
Two important caveats on the FP8 row:
- E4M3 here is “E4M3FN” — the OCP / NVIDIA variant where the all-ones exponent encodes finite values + a single NaN, rather than IEEE-style ±∞. That extra encoding slot is why max = 448 (not 240, which is the strict-IEEE max). The H100, B200, and MI300 all use E4M3FN.
- E5M2 follows IEEE rules: bias 15, ±∞ at all-ones exponent, max finite = 57344. So max = 57344 excludes the inf encoding; the all-ones encoding itself is reserved.
Key observations:
- E4M3 has higher precision (3 mantissa bits → ~7 representable values per power of 2) but tiny range (±448).
- E5M2 has wider range (±57344, like FP16) but lower precision (only 2 mantissa bits → 4 values per power of 2).
- Both are way coarser than BF16. Direct conversion of an arbitrary BF16 tensor would clip and round badly.
The fix: per-tensor (or per-block) scaling.
Per-tensor scaling
Store one BF16 (or FP32) scalar S per tensor. The actual value of a weight is S * fp8_value. To pick S:
- Find the absolute max value
Macross the tensor. - Set
S = M / 448(so the max value, when divided by S, fits in E4M3’s range). - For each weight
w, storequantize(w / S)as FP8.
Memory cost: 1 scalar per tensor, negligible. Quality cost: each value is rounded to its nearest E4M3 representable, but because the dynamic range now matches the tensor’s, no clipping happens and most rounding error is below 0.5%.
For very large tensors, block scaling (one scalar per 32 or 128 elements) trades a bit more memory for tighter range matching → better accuracy. CUTLASS 4 and DeepSeek-V3 both use this.
# Sketch of FP8 quantization
def quantize_fp8_e4m3(w_bf16, block_size=128):
# w_bf16 shape: (out, in)
out, inn = w_bf16.shape
n_blocks = inn // block_size
w_blocks = w_bf16.view(out, n_blocks, block_size)
abs_max = w_blocks.abs().amax(dim=-1, keepdim=True)
scales = abs_max / 448.0 # shape: (out, n_blocks, 1) in BF16
w_fp8 = (w_blocks / scales).to(torch.float8_e4m3fn) # cast to FP8
return w_fp8, scales
# At inference: dequantize via scales, then multiply.
def matmul_fp8(x_fp8, x_scales, w_fp8, w_scales):
# On Hopper this is one fused mma.sync; we sketch the math:
return (x_fp8 * x_scales) @ (w_fp8 * w_scales)The Python above is a math sketch — it’s not what runs at inference. In production, FP8 matmul executes as a single hardware instruction issued from CUDA C++:
// Inside a Hopper kernel — one warpgroup matmul on FP8 inputs, FP32 accumulator.
// 64×128×16 tile per instruction. Scales applied via a fused epilogue.
asm volatile(
"wgmma.mma_async.sync.aligned.m64n128k16.f32"
".f8e4m3.f8e4m3 " // FP8 E4M3 inputs
"{%0,%1,%2,%3,%4,%5,%6,%7}," // 8 output FP32 fragments
" %8, %9, p, 1, 1;\n" // operand descriptors + scale
: "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3]),
"+f"(d[4]), "+f"(d[5]), "+f"(d[6]), "+f"(d[7])
: "l"(a_desc), "l"(b_desc), "r"(p_state)
);CUTLASS / CUTE wraps this in a typed C++ template; you rarely write the inline PTX yourself. But it’s worth seeing once — every FP8 inference kernel ultimately reduces to instructions of this shape.
When to use E4M3 vs E5M2
| Scenario | Format | Why |
|---|---|---|
| Inference weights | E4M3 | Higher precision; weight magnitudes are bounded |
| Inference activations | E4M3 | Same |
| Inference KV cache | E4M3 (or E5M2 for very long context) | Tradeoff |
| Training: gradients (post-AdamW) | E5M2 | Wider range; gradients can have outliers |
| Training: optimizer state | BF16 / FP32 | Don’t quantize the running averages |
Frontier training (DeepSeek-V3, Llama-4 reportedly) does mixed FP8 training: weights and activations in E4M3, gradients in E5M2, master weights in FP32, AdamW state in BF16. This is the recipe for FP8 training without losing accuracy at scale.
KV cache in FP8
The KV cache is a separate quantization opportunity. A 70B model at 32K context with batch 16 has ~80 GB of KV in BF16; halve that with FP8 and you fit twice as many concurrent users in the same VRAM.
vLLM v1 supports kv_cache_dtype="fp8" as a one-flag switch:
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", kv_cache_dtype="fp8")Quality regression on chat / reasoning benchmarks: typically negligible. Long-context retrieval may regress slightly more. Worth A/B-testing on your eval but rarely a blocker.
What it costs to actually run FP8 GEMMs
The peak FP8 throughput numbers from spec sheets:
| GPU | BF16 TFLOPs | FP8 TFLOPs | Effective speedup (decode) |
|---|---|---|---|
| A100 | 312 | n/a | n/a |
| H100 | 989 | 1979 | ~1.7× |
| H200 | 989 | 1979 | ~2.0× (more HBM) |
| B200 | 2250 | 4500 | ~2.5× |
| MI300X | 1307 | 2614 | ~1.7× |
| MI355X | 2400 | 5000 | ~2.4× |
Decode is bandwidth-bound, not compute-bound, so the effective speedup is closer to “bytes saved by FP8” (~2×) than the headline 2.5×. Either way the win is real.
When FP8 hurts
Three failure modes to know:
- Outlier-heavy activations. Some channels (especially in older models, or fine-tuned ones) have outlier activations that overflow E4M3’s range. The fix is rotation quantization (next lesson) or per-channel scaling.
- Long-tail accuracy. Aggregate benchmarks barely move; specific niche tasks (e.g., low-resource translation, detailed math reasoning) sometimes regress 1–2 points. Validate on your eval.
- Old hardware. Pre-Hopper GPUs don’t have FP8 tensor cores; you’d be doing FP8 → BF16 dequantization in software, which is slower than just using BF16. Not a useful target.
For most production workloads in 2026, FP8 is the default and these are edge cases. But knowing they exist lets you debug the rare regression.
Run it in your browser — FP8 quantize / dequantize
You’ll typically see per-tensor scaling at ~3–5% mean relative error and per-block scaling at ~1–2%. That extra precision is what lets FP8 maintain quality at scale.
Quick check
Key takeaways
- FP8 = floating-point at 8 bits, with per-tensor or per-block scaling. E4M3 for weights/activations, E5M2 for gradients.
- Tensor-core native on Hopper+ —
wgmma.fp8.fp8.f32is the production instruction. - ~1.7–2× decode throughput from BF16 → FP8 weights + KV; quality regression typically below 0.5 pts on standard benchmarks.
- Block scaling beats per-tensor. Frontier models use block_size=32 or 128.
- Watch out for outlier-heavy activations — fix with rotation quantization or skip FP8 on those layers.
Go deeper
- PaperFP8 Formats for Deep LearningThe original NVIDIA FP8 paper. Section 3 has the format design rationale.
- PaperDeepSeek-V3 Technical ReportSection 3 details the production FP8 training recipe at 671B / 14.8T-token scale. The clearest worked example.
- DocsNVIDIA Transformer Engine — FP8 DocumentationHow NVIDIA does it in production. Sections on delayed scaling and the autocast API are essential.
- DocsvLLM — FP8 InferenceProduction knobs and observed throughput on real models.
- BlogPyTorch — FP8 on HopperPyTorch's native FP8 support. Includes the float8_e4m3fn dtype and the autocast hooks.
- RepoNVIDIA/TransformerEngineReference impl. `transformer_engine/pytorch/fp8.py` is the autocast layer; the `cpp/` side has the C++ kernels.