Skip to content

FP8 Inference

For most of GPU history, “quantization” meant INT8 — pack each value into a signed 8-bit integer with one scale factor per tensor, hope the dynamic range was tight enough, and accept the calibration mess that came with it. INT8 worked; nobody loved it.

Then Hopper shipped FP8. Specifically two variants — E4M3 and E5M2 — that look like tiny floating-point numbers (1 sign bit, a few exponent bits, a few mantissa bits) instead of integers. The exponent gives them dynamic range without per-tensor calibration tricks, the tensor cores execute them natively at 2× the BF16 throughput, and the accuracy is meaningfully better than INT8 at the same 8 bits.

Today FP8 is the default 2026 production-inference quantization. On H100, BF16 → FP8 weights + KV cache halves your HBM traffic, gets you ~1.7× decode throughput, and typically costs less than 0.5 points of MMLU. If you’re not running FP8 in 2026 inference, you’re paying ~2× more than necessary. This lesson is what FP8 actually is, when to pick E4M3 vs E5M2, and how per-tensor scaling makes the trick work.

TL;DR

  • FP8 is an 8-bit floating-point format (sign + exponent + mantissa). Two variants: E4M3 (4 exponent bits, 3 mantissa, range ~ ±448, used for weights/activations) and E5M2 (5 exponent bits, 2 mantissa, range ~ ±57344, used for gradients during training).
  • Replaced INT8 as the standard quantization format on H100+ because: it preserves exponent flexibility (no per-tensor calibration mess), tensor cores natively support it, and accuracy is meaningfully better at the same bit-width.
  • Hopper added FP8 tensor cores; Blackwell added FP4 / FP6 alongside. All modern AI compilers (CUTLASS, Triton, vLLM, FlashAttention-3) support FP8 weights + activations + KV cache.
  • Quality regression from BF16 → FP8 is typically under 0.5 points on standard benchmarks (MMLU, GSM8K). KV-cache FP8 quantization adds ~0 measurable regression for most models.
  • Per-tensor scaling is the magic — the format itself has limited range, but each tensor (or each block) carries a scale factor so the actual values can span any magnitude. The scale is stored in higher precision (BF16 or FP32).

Mental model

The scale carries the dynamic range; FP8 carries the values. Together you get most of FP16’s precision at half the bytes.

What FP8 actually is, bit-for-bit

A float has three pieces: 1 sign bit, E exponent bits (with bias), M mantissa bits. Examples:

FormatSignExpMantissaBiasMax valueMin positive
FP321823127~3.4e38~1.2e-38
FP1615101565504~6e-5
BF16187127~3.4e38~1.2e-38
E4M3 (FN)14374482^-9 ≈ 0.002
E5M215215573442^-16

Two important caveats on the FP8 row:

  • E4M3 here is “E4M3FN” — the OCP / NVIDIA variant where the all-ones exponent encodes finite values + a single NaN, rather than IEEE-style ±∞. That extra encoding slot is why max = 448 (not 240, which is the strict-IEEE max). The H100, B200, and MI300 all use E4M3FN.
  • E5M2 follows IEEE rules: bias 15, ±∞ at all-ones exponent, max finite = 57344. So max = 57344 excludes the inf encoding; the all-ones encoding itself is reserved.

Key observations:

  • E4M3 has higher precision (3 mantissa bits → ~7 representable values per power of 2) but tiny range (±448).
  • E5M2 has wider range (±57344, like FP16) but lower precision (only 2 mantissa bits → 4 values per power of 2).
  • Both are way coarser than BF16. Direct conversion of an arbitrary BF16 tensor would clip and round badly.

The fix: per-tensor (or per-block) scaling.

Per-tensor scaling

Store one BF16 (or FP32) scalar S per tensor. The actual value of a weight is S * fp8_value. To pick S:

  1. Find the absolute max value M across the tensor.
  2. Set S = M / 448 (so the max value, when divided by S, fits in E4M3’s range).
  3. For each weight w, store quantize(w / S) as FP8.

Memory cost: 1 scalar per tensor, negligible. Quality cost: each value is rounded to its nearest E4M3 representable, but because the dynamic range now matches the tensor’s, no clipping happens and most rounding error is below 0.5%.

For very large tensors, block scaling (one scalar per 32 or 128 elements) trades a bit more memory for tighter range matching → better accuracy. CUTLASS 4 and DeepSeek-V3 both use this.

# Sketch of FP8 quantization def quantize_fp8_e4m3(w_bf16, block_size=128): # w_bf16 shape: (out, in) out, inn = w_bf16.shape n_blocks = inn // block_size w_blocks = w_bf16.view(out, n_blocks, block_size) abs_max = w_blocks.abs().amax(dim=-1, keepdim=True) scales = abs_max / 448.0 # shape: (out, n_blocks, 1) in BF16 w_fp8 = (w_blocks / scales).to(torch.float8_e4m3fn) # cast to FP8 return w_fp8, scales # At inference: dequantize via scales, then multiply. def matmul_fp8(x_fp8, x_scales, w_fp8, w_scales): # On Hopper this is one fused mma.sync; we sketch the math: return (x_fp8 * x_scales) @ (w_fp8 * w_scales)

The Python above is a math sketch — it’s not what runs at inference. In production, FP8 matmul executes as a single hardware instruction issued from CUDA C++:

// Inside a Hopper kernel — one warpgroup matmul on FP8 inputs, FP32 accumulator. // 64×128×16 tile per instruction. Scales applied via a fused epilogue. asm volatile( "wgmma.mma_async.sync.aligned.m64n128k16.f32" ".f8e4m3.f8e4m3 " // FP8 E4M3 inputs "{%0,%1,%2,%3,%4,%5,%6,%7}," // 8 output FP32 fragments " %8, %9, p, 1, 1;\n" // operand descriptors + scale : "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3]), "+f"(d[4]), "+f"(d[5]), "+f"(d[6]), "+f"(d[7]) : "l"(a_desc), "l"(b_desc), "r"(p_state) );

CUTLASS / CUTE wraps this in a typed C++ template; you rarely write the inline PTX yourself. But it’s worth seeing once — every FP8 inference kernel ultimately reduces to instructions of this shape.

When to use E4M3 vs E5M2

ScenarioFormatWhy
Inference weightsE4M3Higher precision; weight magnitudes are bounded
Inference activationsE4M3Same
Inference KV cacheE4M3 (or E5M2 for very long context)Tradeoff
Training: gradients (post-AdamW)E5M2Wider range; gradients can have outliers
Training: optimizer stateBF16 / FP32Don’t quantize the running averages

Frontier training (DeepSeek-V3, Llama-4 reportedly) does mixed FP8 training: weights and activations in E4M3, gradients in E5M2, master weights in FP32, AdamW state in BF16. This is the recipe for FP8 training without losing accuracy at scale.

KV cache in FP8

The KV cache is a separate quantization opportunity. A 70B model at 32K context with batch 16 has ~80 GB of KV in BF16; halve that with FP8 and you fit twice as many concurrent users in the same VRAM.

vLLM v1 supports kv_cache_dtype="fp8" as a one-flag switch:

from vllm import LLM llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", kv_cache_dtype="fp8")

Quality regression on chat / reasoning benchmarks: typically negligible. Long-context retrieval may regress slightly more. Worth A/B-testing on your eval but rarely a blocker.

What it costs to actually run FP8 GEMMs

The peak FP8 throughput numbers from spec sheets:

GPUBF16 TFLOPsFP8 TFLOPsEffective speedup (decode)
A100312n/an/a
H1009891979~1.7×
H2009891979~2.0× (more HBM)
B20022504500~2.5×
MI300X13072614~1.7×
MI355X24005000~2.4×

Decode is bandwidth-bound, not compute-bound, so the effective speedup is closer to “bytes saved by FP8” (~2×) than the headline 2.5×. Either way the win is real.

When FP8 hurts

Three failure modes to know:

  1. Outlier-heavy activations. Some channels (especially in older models, or fine-tuned ones) have outlier activations that overflow E4M3’s range. The fix is rotation quantization (next lesson) or per-channel scaling.
  2. Long-tail accuracy. Aggregate benchmarks barely move; specific niche tasks (e.g., low-resource translation, detailed math reasoning) sometimes regress 1–2 points. Validate on your eval.
  3. Old hardware. Pre-Hopper GPUs don’t have FP8 tensor cores; you’d be doing FP8 → BF16 dequantization in software, which is slower than just using BF16. Not a useful target.

For most production workloads in 2026, FP8 is the default and these are edge cases. But knowing they exist lets you debug the rare regression.

Run it in your browser — FP8 quantize / dequantize

Python — editableApproximate E4M3 quantization with per-tensor scaling. See round-trip error.
Ctrl+Enter to run

You’ll typically see per-tensor scaling at ~3–5% mean relative error and per-block scaling at ~1–2%. That extra precision is what lets FP8 maintain quality at scale.

Quick check

Fill in the blank
The two FP8 variants on H100 — E4M3 for weights/activations, E5M2 for ___
Why does E5M2 have wider range than E4M3?
Quick check
A team migrates 70B inference from BF16 to FP8 on H100, expecting 2× decode throughput. They see only 1.6×. The most likely reason:

Key takeaways

  1. FP8 = floating-point at 8 bits, with per-tensor or per-block scaling. E4M3 for weights/activations, E5M2 for gradients.
  2. Tensor-core native on Hopper+wgmma.fp8.fp8.f32 is the production instruction.
  3. ~1.7–2× decode throughput from BF16 → FP8 weights + KV; quality regression typically below 0.5 pts on standard benchmarks.
  4. Block scaling beats per-tensor. Frontier models use block_size=32 or 128.
  5. Watch out for outlier-heavy activations — fix with rotation quantization or skip FP8 on those layers.

Go deeper

TL;DR

  • FP8 is an 8-bit floating-point format (sign + exponent + mantissa). Two variants: E4M3 (4 exponent bits, 3 mantissa, range ~ ±448, used for weights/activations) and E5M2 (5 exponent bits, 2 mantissa, range ~ ±57344, used for gradients during training).
  • Replaced INT8 as the standard quantization format on H100+ because: it preserves exponent flexibility (no per-tensor calibration mess), tensor cores natively support it, and accuracy is meaningfully better at the same bit-width.
  • Hopper added FP8 tensor cores; Blackwell added FP4 / FP6 alongside. All modern AI compilers (CUTLASS, Triton, vLLM, FlashAttention-3) support FP8 weights + activations + KV cache.
  • Quality regression from BF16 → FP8 is typically under 0.5 points on standard benchmarks (MMLU, GSM8K). KV-cache FP8 quantization adds ~0 measurable regression for most models.
  • Per-tensor scaling is the magic — the format itself has limited range, but each tensor (or each block) carries a scale factor so the actual values can span any magnitude. The scale is stored in higher precision (BF16 or FP32).

Why this matters

FP8 is the single biggest practical compression win in 2024–2026 production inference. Halve the bytes per weight (vs BF16), halve the HBM traffic, ~1.7× decode throughput on H100 — at near-zero quality cost. Every modern serving stack (vLLM v1, SGLang, TensorRT-LLM) defaults to FP8 weights + FP8 KV when the hardware supports it. If you’re not using FP8 in 2026 inference, you’re paying ~2× more than necessary. Knowing how it works under the hood is what lets you tune the per-tensor scaling, debug occasional accuracy regressions, and pick between E4M3 and E5M2.

Mental model

The scale carries the dynamic range; FP8 carries the values. Together you get most of FP16’s precision at half the bytes.

Concrete walkthrough

What FP8 actually is, bit-for-bit

A float has three pieces: 1 sign bit, E exponent bits (with bias), M mantissa bits. Examples:

FormatSignExpMantissaBiasMax valueMin positive
FP321823127~3.4e38~1.2e-38
FP1615101565504~6e-5
BF16187127~3.4e38~1.2e-38
E4M3 (FN)14374482^-9 ≈ 0.002
E5M215215573442^-16

Two important caveats on the FP8 row:

  • E4M3 here is “E4M3FN” — the OCP / NVIDIA variant where the all-ones exponent encodes finite values + a single NaN, rather than IEEE-style ±∞. That extra encoding slot is why max = 448 (not 240, which is the strict-IEEE max). The H100, B200, and MI300 all use E4M3FN.
  • E5M2 follows IEEE rules: bias 15, ±∞ at all-ones exponent, max finite = 57344. So max = 57344 excludes the inf encoding; the all-ones encoding itself is reserved.

Key observations:

  • E4M3 has higher precision (3 mantissa bits → ~7 representable values per power of 2) but tiny range (±448).
  • E5M2 has wider range (±57344, like FP16) but lower precision (only 2 mantissa bits → 4 values per power of 2).
  • Both are way coarser than BF16. Direct conversion of an arbitrary BF16 tensor would clip and round badly.

The fix: per-tensor (or per-block) scaling.

Per-tensor scaling

Store one BF16 (or FP32) scalar S per tensor. The actual value of a weight is S * fp8_value. To pick S:

  1. Find the absolute max value M across the tensor.
  2. Set S = M / 448 (so the max value, when divided by S, fits in E4M3’s range).
  3. For each weight w, store quantize(w / S) as FP8.

Memory cost: 1 scalar per tensor, negligible. Quality cost: each value is rounded to its nearest E4M3 representable, but because the dynamic range now matches the tensor’s, no clipping happens and most rounding error is below 0.5%.

For very large tensors, block scaling (one scalar per 32 or 128 elements) trades a bit more memory for tighter range matching → better accuracy. CUTLASS 4 and DeepSeek-V3 both use this.

# Sketch of FP8 quantization def quantize_fp8_e4m3(w_bf16, block_size=128): # w_bf16 shape: (out, in) out, inn = w_bf16.shape n_blocks = inn // block_size w_blocks = w_bf16.view(out, n_blocks, block_size) abs_max = w_blocks.abs().amax(dim=-1, keepdim=True) scales = abs_max / 448.0 # shape: (out, n_blocks, 1) in BF16 w_fp8 = (w_blocks / scales).to(torch.float8_e4m3fn) # cast to FP8 return w_fp8, scales # At inference: dequantize via scales, then multiply. def matmul_fp8(x_fp8, x_scales, w_fp8, w_scales): # On Hopper this is one fused mma.sync; we sketch the math: return (x_fp8 * x_scales) @ (w_fp8 * w_scales)

The Python above is a math sketch — it’s not what runs at inference. In production, FP8 matmul executes as a single hardware instruction issued from CUDA C++:

// Inside a Hopper kernel — one warpgroup matmul on FP8 inputs, FP32 accumulator. // 64×128×16 tile per instruction. Scales applied via a fused epilogue. asm volatile( "wgmma.mma_async.sync.aligned.m64n128k16.f32" ".f8e4m3.f8e4m3 " // FP8 E4M3 inputs "{%0,%1,%2,%3,%4,%5,%6,%7}," // 8 output FP32 fragments " %8, %9, p, 1, 1;\n" // operand descriptors + scale : "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3]), "+f"(d[4]), "+f"(d[5]), "+f"(d[6]), "+f"(d[7]) : "l"(a_desc), "l"(b_desc), "r"(p_state) );

CUTLASS / CUTE wraps this in a typed C++ template; you rarely write the inline PTX yourself. But it’s worth seeing once — every FP8 inference kernel ultimately reduces to instructions of this shape.

When to use E4M3 vs E5M2

ScenarioFormatWhy
Inference weightsE4M3Higher precision; weight magnitudes are bounded
Inference activationsE4M3Same
Inference KV cacheE4M3 (or E5M2 for very long context)Tradeoff
Training: gradients (post-AdamW)E5M2Wider range; gradients can have outliers
Training: optimizer stateBF16 / FP32Don’t quantize the running averages

Frontier training (DeepSeek-V3, Llama-4 reportedly) does mixed FP8 training: weights and activations in E4M3, gradients in E5M2, master weights in FP32, AdamW state in BF16. This is the recipe for FP8 training without losing accuracy at scale.

KV cache in FP8

The KV cache is a separate quantization opportunity. A 70B model at 32K context with batch 16 has ~80 GB of KV in BF16; halve that with FP8 and you fit twice as many concurrent users in the same VRAM.

vLLM v1 supports kv_cache_dtype="fp8" as a one-flag switch:

from vllm import LLM llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", kv_cache_dtype="fp8")

Quality regression on chat / reasoning benchmarks: typically negligible. Long-context retrieval may regress slightly more. Worth A/B-testing on your eval but rarely a blocker.

What it costs to actually run FP8 GEMMs

The peak FP8 throughput numbers from spec sheets:

GPUBF16 TFLOPsFP8 TFLOPsEffective speedup (decode)
A100312n/an/a
H1009891979~1.7×
H2009891979~2.0× (more HBM)
B20022504500~2.5×
MI300X13072614~1.7×
MI355X24005000~2.4×

Decode is bandwidth-bound, not compute-bound, so the effective speedup is closer to “bytes saved by FP8” (~2×) than the headline 2.5×. Either way the win is real.

When FP8 hurts

Three failure modes to know:

  1. Outlier-heavy activations. Some channels (especially in older models, or fine-tuned ones) have outlier activations that overflow E4M3’s range. The fix is rotation quantization (next lesson) or per-channel scaling.
  2. Long-tail accuracy. Aggregate benchmarks barely move; specific niche tasks (e.g., low-resource translation, detailed math reasoning) sometimes regress 1–2 points. Validate on your eval.
  3. Old hardware. Pre-Hopper GPUs don’t have FP8 tensor cores; you’d be doing FP8 → BF16 dequantization in software, which is slower than just using BF16. Not a useful target.

For most production workloads in 2026, FP8 is the default and these are edge cases. But knowing they exist lets you debug the rare regression.

Run it in your browser — FP8 quantize / dequantize

Python — editableApproximate E4M3 quantization with per-tensor scaling. See round-trip error.
Ctrl+Enter to run

You’ll typically see per-tensor scaling at ~3–5% mean relative error and per-block scaling at ~1–2%. That extra precision is what lets FP8 maintain quality at scale.

Quick check

Fill in the blank
The two FP8 variants on H100 — E4M3 for weights/activations, E5M2 for ___
Why does E5M2 have wider range than E4M3?
Quick check
A team migrates 70B inference from BF16 to FP8 on H100, expecting 2× decode throughput. They see only 1.6×. The most likely reason:

Key takeaways

  1. FP8 = floating-point at 8 bits, with per-tensor or per-block scaling. E4M3 for weights/activations, E5M2 for gradients.
  2. Tensor-core native on Hopper+wgmma.fp8.fp8.f32 is the production instruction.
  3. ~1.7–2× decode throughput from BF16 → FP8 weights + KV; quality regression typically below 0.5 pts on standard benchmarks.
  4. Block scaling beats per-tensor. Frontier models use block_size=32 or 128.
  5. Watch out for outlier-heavy activations — fix with rotation quantization or skip FP8 on those layers.

Go deeper