Skip to content

MXFP4 / NVFP4

Prereqs: FP8 Inference, INT4 / AWQ / GPTQ. The microscaling story builds directly on per-block FP8 and INT4 ideas.

INT4 quantization is a software trick: pack two weights per byte, store a scale per group, write a fast unpack kernel, hope the recipe (AWQ or GPTQ) recovers the accuracy. It works, but the GPU has no idea you did any of it — every multiply still happens in FP16 after a software dequant.

— MXFP4, NVFP4, and their wider cousins MXFP6 and MXFP8 — flip that. They’re 4-bit floating-point values with a tiny per-block scale, defined by an OCP industry standard, and the Blackwell consume them natively. No software unpack. The mma instruction takes FP4 weights, FP16 activations, and produces an FP16 accumulator at full tensor-core throughput. On B200 that’s roughly 2× the FP8 rate.

The bet: block-FP4 + hardware support beats INT4 + AWQ/GPTQ in both speed and quality, and the field is converging on it. NVIDIA, AMD (MXFP4 in MI355X), and the OCP spec community are aligned. Knowing what FP4 actually is now puts you ahead of the 2026–2027 pivot.

TL;DR

  • MXFP4 (Microscaling FP4) is an OCP-standardized 4-bit floating-point format with per-block scale factors. Each 32-element block has its own E8M0 (8-bit exponent) scale; the elements themselves are 4-bit floats (E2M1: 1 sign, 2 exp, 1 mantissa).
  • NVFP4 is NVIDIA’s variant — same idea, slightly different scale factor (E4M3 instead of E8M0), block size 16 instead of 32. Higher accuracy at slightly higher metadata cost.
  • Blackwell tensor cores natively support these formats. MXFP4/NVFP4 weights × FP16 activations → FP16 accumulator at full tensor-core throughput. No software unpack-and-dequantize.
  • Quality vs INT4 AWQ/GPTQ: comparable on most benchmarks; better on long-tail outliers because the floating-point representation handles wide dynamic range that INT4 has to scale for.
  • Why this matters for 2026–2027: if Blackwell tensor cores are 2× faster on FP4 than FP8, and quality holds, FP4 becomes the default inference format. INT4 may become legacy.

Mental model

The scale is per-block (not per-tensor); the block size is small (16–32 elements); and crucially the math at runtime is one tensor-core instruction, not a software unpack.

What’s in 4 bits, exactly

Each FP4 element (E2M1):

SignExp (2 bits)Mantissa (1 bit)Representable values
0/100, 01, 10, 110 / 1±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6

That’s it. 15 distinct values (the 16th is a NaN encoding). Per-block, every weight quantizes to one of these.

The block scale is then multiplied in:

  • MXFP4: scale is 2^k for k ∈ [-127, 128]. Stored as 8 bits (the exponent E8M0, reinterpreted as the power of 2). Per 32-element block.
  • NVFP4: scale is an arbitrary E4M3 value (range ±448). Stored as 8 bits. Per 16-element block.

Per-block metadata cost:

  • MXFP4: 1 byte / 32 elements = 0.25 bits per element extra (4.25 bits/weight effective)
  • NVFP4: 1 byte / 16 elements = 0.5 bits per element extra (4.5 bits/weight effective)

For a 70B model:

  • MXFP4: 70B × 4.25 / 8 = ~37 GB
  • NVFP4: 70B × 4.5 / 8 = ~39 GB

vs INT4 with group_size=128 and BF16 scale at ~36 GB. Comparable size, much better speed.

Why per-block, why FP-not-INT

Two big design choices:

Per-block scaling (vs per-tensor or per-channel) is what makes 4-bit math work without calibration data. Each block adapts to its local distribution. A 32-element chunk of a normally-distributed weight matrix has an absolute-max value that varies; a per-block scale captures this variation cheaply. INT4 with group_size=128 does the same thing but with an integer-not-float element representation.

Floating-point not integer: a 4-bit FP element captures wider dynamic range within the block than a 4-bit INT does. INT4 elements are evenly-spaced in [-8, 7]; FP4 elements are log-spaced (powers of 2 with 1 mantissa bit per binade). For weight distributions with heavy tails — which is most of them — log-spaced is a better match.

The combined effect: MXFP4 / NVFP4 deliver near-INT4 size with near-FP8 accuracy, especially on long-tail benchmarks where INT4 AWQ/GPTQ tend to drop a fraction of a point that FP4 doesn’t.

Hardware support

GPUFP4 tensor-core supportTFLOPs (FP4)
H100Non/a
H200Non/a
B200Yes (5th-gen TC)~9000
GB200Yes~10000
MI300XNon/a
MI355XYes (matrix cores)~10000

On B200 / MI355X, FP4 is 2× faster than FP8 at the tensor-core level. Combined with halved HBM traffic (4 bits → 2 bits per multiply), end-to-end decode throughput approaches the headline 2× ratio.

Pre-Blackwell hardware can still use MXFP4 for storage (smaller models on disk and in HBM), but inference compute happens by dequantizing to FP8/FP16 first. That removes the speedup.

What the kernel actually issues

On Blackwell, FP4 inference is a single tensor-core instruction per tile. The wgmma family gained FP4 input variants in the 5th-gen tensor core; CUTLASS 4 wraps them in a typed C++ template. The raw looks like:

// Blackwell warpgroup matmul on FP4 inputs, FP16 accumulator. // 64 × 128 × 32 tile per instruction. Per-block scales are fed in // alongside the operand descriptors — the tensor core applies them. asm volatile( "wgmma.mma_async.sync.aligned.m64n128k32.f16" ".f4e2m1.f4e2m1 " // FP4 E2M1 inputs "{%0,%1,%2,%3,%4,%5,%6,%7}," // 8 output FP16 fragments " %8, %9, %10, %11, p, 1, 1;\n" // operand desc + per-block scales : "+r"(d[0]), "+r"(d[1]), "+r"(d[2]), "+r"(d[3]), "+r"(d[4]), "+r"(d[5]), "+r"(d[6]), "+r"(d[7]) : "l"(a_desc), "l"(b_desc), "l"(a_scales_desc), "l"(b_scales_desc), "r"(p_state) );

Compare to the FP8 instruction in the FP8 lesson: same shape, narrower elements, separate scale-descriptor operands. The whole point of the format is that the tensor core consumes the scales — there’s no software dequant in the inner loop.

You don’t write that PTX yourself. CUTLASS’s Sm100 collective MMA templates emit it for you. But it’s worth seeing the instruction once: every modern FP4 inference kernel reduces to a series of wgmma instructions of this shape.

Calling it from Python

The user surface is one config flag:

from vllm import LLM llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", quantization="mxfp4", # or "nvfp4" on Blackwell ) # Under the hood: vLLM converts weights to MXFP4, stores per-block scales, # uses Sm100 (Blackwell) kernels that take FP4 weights + FP16 activations.

That’s the whole serving change. The vLLM weight loader handles the BF16 → MXFP4 conversion (or loads pre-quantized checkpoints from Hugging Face); the kernel selection is automatic based on the GPU SM version.

For training, FP4 is not yet a stable production target — the rounding is too aggressive for gradient updates. Some 2025 research (FP4 training papers) shows it’s possible with carefully-tuned recipes, but as of 2026 it’s not the default.

MXFP4 vs NVFP4 vs MXFP6 vs MXFP8

The microscaling family has multiple bit widths:

FormatBits/elemBlock sizeEffective bits/weightUse case
MXFP4432~4.25Aggressive inference
NVFP4416~4.5Higher-quality 4-bit
MXFP6632~6.25Quality-preferred low-bit
MXFP8832~8.25Production training (E4M3 + per-block scale)

MXFP8 is essentially “FP8 with per-block scaling instead of per-tensor” — one of the formats DeepSeek-V3 uses for training stability. MXFP6 is a niche quality-preferred option; MXFP4 / NVFP4 are the production inference formats for 2025+.

Where the gotchas are

  1. Compiler maturity. MXFP4 / NVFP4 support is younger than FP8. vLLM v1, SGLang, TensorRT-LLM all support it as of 2025; ExecuTorch and IREE have it landed but with more rough edges. Triton-AMD’s MXFP4 support arrived in early 2025.
  2. Block-size mismatch issues. Some kernels assume a specific block size; mixing MXFP4 (block=32) and NVFP4 (block=16) data in the same model breaks them. Pin one or the other across all layers.
  3. Quality on small models. Sub-3B models tend to lose accuracy faster with 4-bit anything. Stick to FP8 for tiny models.

Run it in your browser — MXFP4 vs INT4 simulator

Python — editableQuantize a synthetic weight to MXFP4 (block-FP) and INT4 (per-group); compare quality.
Ctrl+Enter to run

You’ll typically see MXFP4 deliver lower mean error at similar bit-width, especially on outlier-heavy weights. That’s the production case for the format.

Quick check

Fill in the blank
The block size MXFP4 uses (per the OCP microscaling spec):
A power of 2; smaller than INT4's typical group size.
Quick check
A team running 70B inference on B200 wonders whether to use FP8 or MXFP4. Most relevant tradeoff:

Key takeaways

  1. MXFP4 / NVFP4 = 4-bit float + per-block scale. OCP-standardized; blocks of 32 (MXFP4) or 16 (NVFP4).
  2. Blackwell + MI355X tensor cores native. ~2× FP8 throughput on the new silicon; H100/MI300X store-only.
  3. Comparable size to INT4, comparable or better quality, especially on outlier-heavy distributions.
  4. 2025–2027 inflection. Expect FP4 to become the default inference format on Blackwell-class GPUs.
  5. Training in FP4 is research-only as of 2026. Stick to FP8 / MXFP8 for production training.

Go deeper

Prereqs: FP8 Inference, INT4 / AWQ / GPTQ. The microscaling story builds directly on per-block FP8 and INT4 ideas.

TL;DR

  • MXFP4 (Microscaling FP4) is an OCP-standardized 4-bit floating-point format with per-block scale factors. Each 32-element block has its own E8M0 (8-bit exponent) scale; the elements themselves are 4-bit floats (E2M1: 1 sign, 2 exp, 1 mantissa).
  • NVFP4 is NVIDIA’s variant — same idea, slightly different scale factor (E4M3 instead of E8M0), block size 16 instead of 32. Higher accuracy at slightly higher metadata cost.
  • Blackwell tensor cores natively support these formats. MXFP4/NVFP4 weights × FP16 activations → FP16 accumulator at full tensor-core throughput. No software unpack-and-dequantize.
  • Quality vs INT4 AWQ/GPTQ: comparable on most benchmarks; better on long-tail outliers because the floating-point representation handles wide dynamic range that INT4 has to scale for.
  • Why this matters for 2026–2027: if Blackwell tensor cores are 2× faster on FP4 than FP8, and quality holds, FP4 becomes the default inference format. INT4 may become legacy.

Why this matters

The big bet of the microscaling formats is that block-FP4 + hardware support beats INT4 + AWQ/GPTQ in both speed and accuracy. NVIDIA, AMD (with MXFP4 support in MI355X), and the OCP specification community are all aligned on this. If the bet pays off — and 2025 results suggest it will — every production inference stack pivots to FP4 in the next 18 months. Knowing what FP4 actually is now puts you ahead of that pivot.

Mental model

The scale is per-block (not per-tensor); the block size is small (16–32 elements); and crucially the math at runtime is one tensor-core instruction, not a software unpack.

Concrete walkthrough

What’s in 4 bits, exactly

Each FP4 element (E2M1):

SignExp (2 bits)Mantissa (1 bit)Representable values
0/100, 01, 10, 110 / 1±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6

That’s it. 15 distinct values (the 16th is a NaN encoding). Per-block, every weight quantizes to one of these.

The block scale is then multiplied in:

  • MXFP4: scale is 2^k for k ∈ [-127, 128]. Stored as 8 bits (the exponent E8M0, reinterpreted as the power of 2). Per 32-element block.
  • NVFP4: scale is an arbitrary E4M3 value (range ±448). Stored as 8 bits. Per 16-element block.

Per-block metadata cost:

  • MXFP4: 1 byte / 32 elements = 0.25 bits per element extra (4.25 bits/weight effective)
  • NVFP4: 1 byte / 16 elements = 0.5 bits per element extra (4.5 bits/weight effective)

For a 70B model:

  • MXFP4: 70B × 4.25 / 8 = ~37 GB
  • NVFP4: 70B × 4.5 / 8 = ~39 GB

vs INT4 with group_size=128 and BF16 scale at ~36 GB. Comparable size, much better speed.

Why per-block, why FP-not-INT

Two big design choices:

Per-block scaling (vs per-tensor or per-channel) is what makes 4-bit math work without calibration data. Each block adapts to its local distribution. A 32-element chunk of a normally-distributed weight matrix has an absolute-max value that varies; a per-block scale captures this variation cheaply. INT4 with group_size=128 does the same thing but with an integer-not-float element representation.

Floating-point not integer: a 4-bit FP element captures wider dynamic range within the block than a 4-bit INT does. INT4 elements are evenly-spaced in [-8, 7]; FP4 elements are log-spaced (powers of 2 with 1 mantissa bit per binade). For weight distributions with heavy tails — which is most of them — log-spaced is a better match.

The combined effect: MXFP4 / NVFP4 deliver near-INT4 size with near-FP8 accuracy, especially on long-tail benchmarks where INT4 AWQ/GPTQ tend to drop a fraction of a point that FP4 doesn’t.

Hardware support

GPUFP4 tensor-core supportTFLOPs (FP4)
H100Non/a
H200Non/a
B200Yes (5th-gen TC)~9000
GB200Yes~10000
MI300XNon/a
MI355XYes (matrix cores)~10000

On B200 / MI355X, FP4 is 2× faster than FP8 at the tensor-core level. Combined with halved HBM traffic (4 bits → 2 bits per multiply), end-to-end decode throughput approaches the headline 2× ratio.

Pre-Blackwell hardware can still use MXFP4 for storage (smaller models on disk and in HBM), but inference compute happens by dequantizing to FP8/FP16 first. That removes the speedup.

How it works in practice

Compiling a model to FP4 is a quantization + format-conversion step:

# vLLM v1+ pseudocode from vllm import LLM llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", quantization="mxfp4") # Under the hood: vLLM converts weights to MXFP4, stores per-block scales, # uses Blackwell-aware kernels that take FP4 weights + FP16 activations directly.

For training, FP4 is not yet a stable production target — the rounding is too aggressive for gradient updates. Some 2025 research (FP4 training papers) shows it’s possible with carefully-tuned recipes, but as of 2026 it’s not the default.

MXFP4 vs NVFP4 vs MXFP6 vs MXFP8

The microscaling family has multiple bit widths:

FormatBits/elemBlock sizeEffective bits/weightUse case
MXFP4432~4.25Aggressive inference
NVFP4416~4.5Higher-quality 4-bit
MXFP6632~6.25Quality-preferred low-bit
MXFP8832~8.25Production training (E4M3 + per-block scale)

MXFP8 is essentially “FP8 with per-block scaling instead of per-tensor” — one of the formats DeepSeek-V3 uses for training stability. MXFP6 is a niche quality-preferred option; MXFP4 / NVFP4 are the production inference formats for 2025+.

Where the gotchas are

  1. Compiler maturity. MXFP4 / NVFP4 support is younger than FP8. vLLM v1, SGLang, TensorRT-LLM all support it as of 2025; ExecuTorch and IREE have it landed but with more rough edges. Triton-AMD’s MXFP4 support arrived in early 2025.
  2. Block-size mismatch issues. Some kernels assume a specific block size; mixing MXFP4 (block=32) and NVFP4 (block=16) data in the same model breaks them. Pin one or the other across all layers.
  3. Quality on small models. Sub-3B models tend to lose accuracy faster with 4-bit anything. Stick to FP8 for tiny models.

Run it in your browser — MXFP4 vs INT4 simulator

Python — editableQuantize a synthetic weight to MXFP4 (block-FP) and INT4 (per-group); compare quality.
Ctrl+Enter to run

You’ll typically see MXFP4 deliver lower mean error at similar bit-width, especially on outlier-heavy weights. That’s the production case for the format.

Quick check

Fill in the blank
The block size MXFP4 uses (per the OCP microscaling spec):
A power of 2; smaller than INT4's typical group size.
Quick check
A team running 70B inference on B200 wonders whether to use FP8 or MXFP4. Most relevant tradeoff:

Key takeaways

  1. MXFP4 / NVFP4 = 4-bit float + per-block scale. OCP-standardized; blocks of 32 (MXFP4) or 16 (NVFP4).
  2. Blackwell + MI355X tensor cores native. ~2× FP8 throughput on the new silicon; H100/MI300X store-only.
  3. Comparable size to INT4, comparable or better quality, especially on outlier-heavy distributions.
  4. 2025–2027 inflection. Expect FP4 to become the default inference format on Blackwell-class GPUs.
  5. Training in FP4 is research-only as of 2026. Stick to FP8 / MXFP8 for production training.

Go deeper