Skip to content

Recipes

breccia ships six ScalingRecipe variants in v0.0.1. Each is declarative metadata — the what, not the how. The quantization algorithm dispatches on recipe type inside breccia.cast.

At a glance

Recipe Format Block / scope Scale dtype When to reach for it
DelayedScaling FP8 E4M3 or E5M2 per-tensor fp32 FP8 training with TE-style amax history
Float8CurrentScaling FP8 E4M3 or E5M2 per-tensor (synchronous) fp32 FP8 inference; FP8 forward in simple training loops
Float8BlockScaling FP8 E4M3 or E5M2 per-128-block along K fp32 FP8 weights with varying row magnitudes (DeepSeek-v3)
MXFP8BlockScaling FP8 (E4M3/E5M2) + E8M0 scale per-32-block along K E8M0 (uint8) Hardware microscaling on Blackwell / future MX-capable chips
NVFP4BlockScaling FP4 E2M1 + FP8 E4M3 scale per-16-block along K E4M3 (uint8) Blackwell-class FP4 inference; FP4 training
INT4Scaling INT4 (signed or unsigned) per-group along K fp16 / bf16 / fp32 INT4 weight-only inference (GPTQ / AWQ family)

DelayedScaling

from breccia import DelayedScaling

DelayedScaling(
    fp8_format="E4M3",       # "E4M3" or "E5M2"
    amax_history_len=16,     # how many previous amax values the training loop retains
    margin=0,                # power-of-two exponent margin (TE default = 0)
)

The classic TransformerEngine recipe. Uses a rolling history of recent amax values to compute the next scale, avoiding a synchronous reduction on every cast.

The history itself lives outside the recipe (in your training loop's optimizer state). The recipe is portable metadata.

Use when: you're training in FP8 and want TE-comparable throughput.

Float8CurrentScaling

Float8CurrentScaling(fp8_format="E4M3")

The simplest FP8 recipe: compute amax(x) each step, scale by fp8_max / amax. Synchronous (every cast is a reduction over the tensor), so slightly slower than delayed scaling in training, but trivial to reason about and exact for the current tensor.

Use when: FP8 inference; quick experiments; small-batch training where the reduction is cheap.

Float8BlockScaling

Float8BlockScaling(fp8_format="E4M3", block_k=128)

One scale per block_k-element block along the K (contraction) dim. Better dynamic range than per-tensor scaling when row magnitudes vary substantially — common in attention weights and FFN gates.

DeepSeek-v3 ships its FP8 weights with block_k=128 and E4M3 format; breccia.bridges.from_deepseek_v3 is a thin wrapper.

Use when: weight matrices where some rows are systematically larger than others (post-norm features, gated linear units).

MXFP8BlockScaling

MXFP8BlockScaling(fp8_format="E4M3", block_size=32)  # block_size fixed at 32 by spec

The OCP MX microscaling standard for FP8. Two key properties fixed by the standard:

  • block_size == 32 (hardware-locked)
  • scale is E8M0 — an 8-bit unsigned exponent encoding a power-of-two scale 2^(byte - 127)

This is the format Blackwell-class hardware and future MX-aware silicon accelerate natively. The trade-off vs Float8BlockScaling: the scale is power-of-two (less precise) but takes only 1 byte per block (smaller overhead).

Use when: targeting MX-capable hardware where the power-of-two scale is a hardware-native operation.

NVFP4BlockScaling

NVFP4BlockScaling(
    fp4_format="E2M1",    # fixed
    block_size=16,        # fixed by NVIDIA Blackwell spec
    scale_format="E4M3",  # fixed: FP8 E4M3 scale
)

NVIDIA Blackwell's NVFP4 format. FP4 data (only 16 representable values per element) with a per-16-block FP8 E4M3 scale (256 representable values per block).

This is the densest format breccia ships — 4 bits per value. Cosine similarity vs FP32 stays above 0.95 on Gaussian-distributed inputs; real workloads have shown convergence-quality FP4 training (see the "Pretraining LLMs with NVFP4" arXiv paper).

Use when: targeting Blackwell hardware for inference; experimental FP4 training.

INT4Scaling

INT4Scaling(
    group_size=128,       # values per shared scale
    signed=True,          # -8..7 (signed) or 0..15 (unsigned)
    scale_dtype="fp16",   # "fp16" | "bf16" | "fp32"
)

INT4 quantization with one scale per group along the K dim. This is the GPTQ / AWQ family of recipes used by ~every open-weight model that ships 4-bit quantized weights.

v0.0.1 supports symmetric quantization only (zero-point = 0). Asymmetric quantization (with a non-zero zero-point) is on the v0.1 list.

Use when: INT4 weight-only inference; loading GPTQ/AWQ-quantized checkpoints; deploying to memory-constrained inference hardware.

Choosing a recipe

A rough decision tree:

Am I doing FP4? ──── NVFP4BlockScaling (Blackwell)

Am I doing INT4? ─── INT4Scaling

Am I doing FP8?
├─ training, want TE-style amax history → DelayedScaling
├─ training, simpler / inference / experimenting → Float8CurrentScaling
├─ weights with varying row magnitudes (DeepSeek-v3) → Float8BlockScaling
└─ targeting MX-capable hardware → MXFP8BlockScaling

Switching recipes (requantize)

breccia.requantize(scaled, new_recipe) converts a ScaledTensor between recipes:

import breccia

# Train in MXFP8…
st_mx = breccia.cast(x, breccia.MXFP8BlockScaling())

# …ship as NVFP4 (no model rewrite required):
st_nv = breccia.requantize(st_mx, breccia.NVFP4BlockScaling())

In v0.0.1 requantize is implemented as cast(dequantize(scaled), new_recipe). Direct cross-recipe paths that avoid the FP32 round-trip are an optimization for v0.1.

Comparing recipes

See numerics.md for measured accuracy / range trade-offs and benchmarks.md for memory savings tables.