Formats¶
Bit-level reference for every low-precision format breccia handles. This is the document to consult when implementing a new kernel, debugging a checkpoint, or wiring breccia into a vendor library.
For recipes (which formats compose with which scale layouts), see recipes.md. For kernels (what runs at compute time), see kernels.md.
FP8 E4M3 (Open Compute "E4M3FN")¶
| Field | Bits | Bias | Notes |
|---|---|---|---|
| Sign | 1 | — | Signed |
| Exponent | 4 | 7 | |
| Mantissa | 3 | — |
- Normal:
value = (-1)^S * 2^(E - 7) * (1 + M / 8)forE ∈ [1, 14] - Subnormal:
value = (-1)^S * 2^-6 * (M / 8)forE = 0 - NaN: byte
0b01111111(E = 15,M = 7). No Infinity. - Max normal:
448.0(S=0, E=14, M=6) - Min subnormal:
2^-9 ≈ 1.95 × 10^-3
256 possible bytes → 254 distinct floats + 2 NaN codes (one for each sign).
Helpers in breccia._formats:
encode_e4m3(x: np.ndarray) -> uint8 ndarray # round-to-nearest, saturating
decode_e4m3(b: np.ndarray) -> float32 ndarray # via 256-entry LUT
FP8 E5M2 (IEEE 754 compatible)¶
| Field | Bits | Bias | Notes |
|---|---|---|---|
| Sign | 1 | — | |
| Exponent | 5 | 15 | |
| Mantissa | 2 | — |
- Normal:
value = (-1)^S * 2^(E - 15) * (1 + M / 4)forE ∈ [1, 30] - Subnormal:
value = (-1)^S * 2^-14 * (M / 4)forE = 0 - ±Infinity:
E = 31, M = 0 - NaN:
E = 31, M ∈ {1, 2, 3} - Max normal:
57344.0(S=0, E=30, M=3) - Min subnormal:
2^-16 ≈ 1.53 × 10^-5
Wider range than E4M3, fewer mantissa bits → less precision. Typical use: gradient tensors in backward pass, weights with very wide dynamic range.
FP4 E2M1 (NVFP4 data format)¶
| Field | Bits | Bias |
|---|---|---|
| Sign | 1 | — |
| Exponent | 2 | 1 |
| Mantissa | 1 | — |
- Normal:
value = (-1)^S * 2^(E - 1) * (1 + M / 2)forE ∈ [1, 3] - Subnormal:
value = (-1)^S * (M / 2)forE = 0 - No Infinity, no NaN
- Max:
6.0
The full 16-value table (8 positive + 8 negative):
In breccia's ScaledTensor.data field, FP4 values are stored one-per-byte
(low 4 bits used) for simplicity. For compact storage in checkpoints,
breccia.bridges packs two nibbles per byte via pack_nibbles —
documented in bridges.md.
E8M0 (OCP MX scale format)¶
No sign, no mantissa. Pure exponent encoding:
| Byte | Value |
|---|---|
| 0 | 2^-127 (= 0 in practice, since fp32 can't represent it as a normal) |
| 127 | 2^0 = 1.0 |
| 128 | 2^1 = 2.0 |
| 255 | 2^128 (overflow; treated as Inf) |
Used by MXFP8BlockScaling as the per-block scale. The trade-off:
power-of-two scales are less precise than fp32 but use 1 byte per block
(vs 4 for fp32) — a 4× scale-buffer reduction.
INT4¶
Two interpretations, both stored in the low 4 bits of a uint8 byte:
| Mode | Range | Bit pattern |
|---|---|---|
| Signed (two's complement) | [-8, 7] |
0b1000 → -8, 0b0111 → 7 |
| Unsigned | [0, 15] |
0b0000 → 0, 0b1111 → 15 |
encode_int4(x: ndarray, signed: bool = True) -> uint8 ndarray
decode_int4(b: ndarray, signed: bool = True) -> float32 ndarray
Compact storage (2 nibbles per byte) via pack_nibbles / unpack_nibbles.
Scale dtypes summary¶
Different recipes use different scale dtypes. Breccia stores each scale in its declared native dtype:
| Recipe | Scale dtype on disk |
|---|---|
| DelayedScaling, Float8CurrentScaling, Float8BlockScaling | float32 |
| MXFP8BlockScaling | uint8 (E8M0 encoded) |
| NVFP4BlockScaling | uint8 (FP8 E4M3 encoded) |
| INT4Scaling | float16 / bfloat16 / float32 (configurable) |
The dequantization function for each recipe knows how to decode its scale dtype back to float32 before multiplying with the decoded data.
Where this is implemented¶
src/breccia/_formats.py contains:
_build_e4m3_lut(),_build_e5m2_lut(),_build_e2m1_lut()— build the decode tables at import timeencode_*anddecode_*for each formatpack_nibbles/unpack_nibblesfor 4-bit compact storage- Module constants:
E4M3_MAX = 448.0,E5M2_MAX = 57344.0,E2M1_MAX = 6.0
Why these specific formats (and not others)¶
| Format | Why included |
|---|---|
| FP8 E4M3 / E5M2 | Used by TransformerEngine (NVIDIA), torchao (PyTorch), CUDA cuBLAS FP8 GEMM |
| FP4 E2M1 | NVIDIA Blackwell NVFP4; also OCP MX FP4 |
| E8M0 | The block-scale encoding for OCP MX FP8 / FP6 / FP4 |
| INT4 | GPTQ / AWQ / smoothquant family; deployed in nearly every open inference engine |
Formats not in v0.0.1: FP6 (OCP MX FP6 E3M2 / E2M3) — added when there's a stable kernel target. INT8 — covered well by torchao; bridge support is sufficient. INT2 / binary — niche.