Skip to content

Formats

Bit-level reference for every low-precision format breccia handles. This is the document to consult when implementing a new kernel, debugging a checkpoint, or wiring breccia into a vendor library.

For recipes (which formats compose with which scale layouts), see recipes.md. For kernels (what runs at compute time), see kernels.md.

FP8 E4M3 (Open Compute "E4M3FN")

bit:    7 | 6 5 4 3 | 2 1 0
        S |    E    |   M
Field Bits Bias Notes
Sign 1 Signed
Exponent 4 7
Mantissa 3
  • Normal: value = (-1)^S * 2^(E - 7) * (1 + M / 8) for E ∈ [1, 14]
  • Subnormal: value = (-1)^S * 2^-6 * (M / 8) for E = 0
  • NaN: byte 0b01111111 (E = 15, M = 7). No Infinity.
  • Max normal: 448.0 (S=0, E=14, M=6)
  • Min subnormal: 2^-9 ≈ 1.95 × 10^-3

256 possible bytes → 254 distinct floats + 2 NaN codes (one for each sign).

Helpers in breccia._formats:

encode_e4m3(x: np.ndarray) -> uint8 ndarray   # round-to-nearest, saturating
decode_e4m3(b: np.ndarray) -> float32 ndarray # via 256-entry LUT

FP8 E5M2 (IEEE 754 compatible)

bit:    7 | 6 5 4 3 2 | 1 0
        S |     E     |  M
Field Bits Bias Notes
Sign 1
Exponent 5 15
Mantissa 2
  • Normal: value = (-1)^S * 2^(E - 15) * (1 + M / 4) for E ∈ [1, 30]
  • Subnormal: value = (-1)^S * 2^-14 * (M / 4) for E = 0
  • ±Infinity: E = 31, M = 0
  • NaN: E = 31, M ∈ {1, 2, 3}
  • Max normal: 57344.0 (S=0, E=30, M=3)
  • Min subnormal: 2^-16 ≈ 1.53 × 10^-5

Wider range than E4M3, fewer mantissa bits → less precision. Typical use: gradient tensors in backward pass, weights with very wide dynamic range.

FP4 E2M1 (NVFP4 data format)

bit:    3 | 2 1 | 0
        S |  E  | M
Field Bits Bias
Sign 1
Exponent 2 1
Mantissa 1
  • Normal: value = (-1)^S * 2^(E - 1) * (1 + M / 2) for E ∈ [1, 3]
  • Subnormal: value = (-1)^S * (M / 2) for E = 0
  • No Infinity, no NaN
  • Max: 6.0

The full 16-value table (8 positive + 8 negative):

0, 0.5, 1, 1.5, 2, 3, 4, 6,
-0, -0.5, -1, -1.5, -2, -3, -4, -6

In breccia's ScaledTensor.data field, FP4 values are stored one-per-byte (low 4 bits used) for simplicity. For compact storage in checkpoints, breccia.bridges packs two nibbles per byte via pack_nibbles — documented in bridges.md.

E8M0 (OCP MX scale format)

bit:    7 6 5 4 3 2 1 0
            E (all 8 bits)

No sign, no mantissa. Pure exponent encoding:

value = 2 ^ (byte - 127)
Byte Value
0 2^-127 (= 0 in practice, since fp32 can't represent it as a normal)
127 2^0 = 1.0
128 2^1 = 2.0
255 2^128 (overflow; treated as Inf)

Used by MXFP8BlockScaling as the per-block scale. The trade-off: power-of-two scales are less precise than fp32 but use 1 byte per block (vs 4 for fp32) — a 4× scale-buffer reduction.

INT4

Two interpretations, both stored in the low 4 bits of a uint8 byte:

Mode Range Bit pattern
Signed (two's complement) [-8, 7] 0b1000 → -8, 0b0111 → 7
Unsigned [0, 15] 0b0000 → 0, 0b1111 → 15
encode_int4(x: ndarray, signed: bool = True) -> uint8 ndarray
decode_int4(b: ndarray, signed: bool = True) -> float32 ndarray

Compact storage (2 nibbles per byte) via pack_nibbles / unpack_nibbles.

Scale dtypes summary

Different recipes use different scale dtypes. Breccia stores each scale in its declared native dtype:

Recipe Scale dtype on disk
DelayedScaling, Float8CurrentScaling, Float8BlockScaling float32
MXFP8BlockScaling uint8 (E8M0 encoded)
NVFP4BlockScaling uint8 (FP8 E4M3 encoded)
INT4Scaling float16 / bfloat16 / float32 (configurable)

The dequantization function for each recipe knows how to decode its scale dtype back to float32 before multiplying with the decoded data.

Where this is implemented

src/breccia/_formats.py contains:

  • _build_e4m3_lut(), _build_e5m2_lut(), _build_e2m1_lut() — build the decode tables at import time
  • encode_* and decode_* for each format
  • pack_nibbles / unpack_nibbles for 4-bit compact storage
  • Module constants: E4M3_MAX = 448.0, E5M2_MAX = 57344.0, E2M1_MAX = 6.0

Why these specific formats (and not others)

Format Why included
FP8 E4M3 / E5M2 Used by TransformerEngine (NVIDIA), torchao (PyTorch), CUDA cuBLAS FP8 GEMM
FP4 E2M1 NVIDIA Blackwell NVFP4; also OCP MX FP4
E8M0 The block-scale encoding for OCP MX FP8 / FP6 / FP4
INT4 GPTQ / AWQ / smoothquant family; deployed in nearly every open inference engine

Formats not in v0.0.1: FP6 (OCP MX FP6 E3M2 / E2M3) — added when there's a stable kernel target. INT8 — covered well by torchao; bridge support is sufficient. INT2 / binary — niche.