Skip to content

Concepts

The mental model behind breccia.ScaledTensor. Read this once and the rest of the library stops surprising you.

The data structure

A ScaledTensor is four things:

data    — low-precision bytes (FP8 native dtype, or uint8 for FP4/INT4)
scale   — scale tensor that gives the data its high-precision meaning
recipe  — a ScalingRecipe describing HOW the data was quantized
layout  — a Layout describing HOW the scale maps to data blocks

For a 2-D weight matrix of shape (M, K) = (8, 128) quantized with Float8CurrentScaling:

data:   shape (8, 128), dtype uint8       # one byte per FP8 value
scale:  shape (), dtype float32           # single scalar
recipe: Float8CurrentScaling(fp8_format="E4M3")
layout: PerTensor()

That's it. Every operation in breccia is defined on top of this quadruple.

Why this representation

Every framework today reinvents block-scaled low-precision in incompatible ways. The fragmentation is real:

Approach Pros Cons
NVIDIA TransformerEngine hardware-tuned for Hopper / Blackwell NVIDIA-only, 4 non-composable recipe classes
PyTorch torchao autograd-friendly PyTorch-only
DeepSeek-v3 format proven recipe for FP8 training private to one repo
HuggingFace + custom huge ecosystem every model rolls its own scale convention
FP8-Flow-MoE, COAT each solves a specific gap each is incompatible

ScaledTensor is what they all converge on, exposed as one neutral type that round-trips with each of them via breccia.bridges.

Why scale is the dequantization scale

Convention varies across libraries. breccia picks the dequantization scale:

data_decoded     = decode(data)                 # decode the low-precision bytes
high_precision_x = data_decoded * scale         # multiply by stored scale to recover

So if x had amax = 10 and we're encoding to FP8 E4M3 (whose max is 448):

scale = amax / fp8_max  = 10 / 448  = 0.0223
data  = encode(x / scale) = encode(x * 44.8)    # values now in FP8 range

This convention matches the OCP MX standard and is exactly how hardware scaled-matmul kernels (FP8 GEMM, NVFP4 GEMM) consume the scale tensor. NVIDIA TransformerEngine calls the same thing _scale_inv.

The invariants

ScaledTensor enforces at construction:

Invariant Raised
data is array-like (has .shape, .dtype) TypeError("data must be array-like ...")
scale is array-like TypeError("scale must be array-like ...")
data.ndim >= 1 ValueError("data must be at least 1-D ...")
recipe is not None ValueError("recipe is required ...")
layout is not None ValueError("layout is required ...")
layout.validate(data, scale) succeeds ValueError from the layout

The Layout's .validate(data, scale) method is the single source of truth for the data-vs-scale shape relationship. See recipes.md and api.md for the layouts.

Recipes are pure metadata

A recipe is declarative configuration — it carries the format identifier and any recipe-specific parameters (block size, amax history length, zero-point semantics) but contains no quantization behavior itself. All behavior lives in breccia.cast and dispatches on recipe type.

This means recipes are:

  • Frozen dataclasses — immutable and hashable
  • Comparable — two recipes with the same fields are equal
  • JSON-serializable — used by the HuggingFace bridge to round-trip through safetensors metadata
  • Hardware-portable — the same recipe can be implemented by any backend

Layouts are how the scale maps to data blocks

Four layouts cover today's recipe fragmentation:

Layout Scale shape for data (M, K) Used by
PerTensor () — single scalar DelayedScaling, Float8CurrentScaling
PerBlockK(B) (M, K // B) Float8BlockScaling, INT4Scaling
PerChannel (M,) or (M, 1) INT4 row-wise quantization
PerBlockMN(Bm, Bn) (M // Bm, K // Bn) MXFP8 (1, 32), NVFP4 (1, 16)

A layout's .validate(data, scale) enforces this contract. The validator is called from ScaledTensor.__post_init__, so an inconsistent (data, scale) pair fails at construction time — not at the next matmul.

The relationship to NVIDIA TransformerEngine

TransformerEngine's Float8Tensor carries _data (uint8 bytes) and _scale_inv (the dequantization scale). The mapping to breccia is direct:

from breccia.bridges import from_transformer_engine, to_transformer_engine

# TE → breccia
st = from_transformer_engine(te_tensor)
# st.data is te_tensor._data, st.scale is te_tensor._scale_inv

# breccia → TE
te_t = to_transformer_engine(st)

See bridges.md.

The relationship to OCP MX (Microscaling)

OCP MX is the open standard for block-scaled low-precision (FP8, FP6, FP4) with an E8M0 (uint8 exponent-only) scale. breccia's MXFP8BlockScaling recipe with PerBlockMN(1, 32) layout is the OCP MX MXFP8 format.

The scale stored is an E8M0 byte; multiplying by 2^(byte - 127) gives the floating-point dequantization scale.

Dispatch and backend selection

ScaledTensor is backend-agnostic: data and scale can be NumPy arrays, PyTorch tensors, or MLX arrays. The functions in breccia.* detect the backend at runtime and dispatch.

type(st.data).__module__   # 'numpy' | 'torch' | 'mlx.core'

This is implemented with three small helper predicates inside breccia._core:

def _is_torch(x): return type(x).__module__.startswith("torch")
def _is_mlx(x):   return type(x).__module__.startswith("mlx")
def _is_jax(x):   mod = type(x).__module__; return mod.startswith("jax") or mod.startswith("jaxlib")

Anything not torch / MLX / JAX falls into the NumPy code path. No plugin registry; the dispatch table is small enough to inline. See architecture.md for the rationale.

When NOT to use breccia

  • You have an FP16 / BF16 workload that fits in memory. breccia trades precision for memory and bandwidth. If you don't need the trade, don't pay the cost.
  • You need autograd-tracked low-precision tensors that subclass torch.Tensor. breccia's ScaledTensor is a plain dataclass, not a Tensor subclass. The autograd "lives in" data (which can be a Tensor) but the wrapper itself doesn't participate in autograd graphs.
  • You only target one vendor's hardware. If you're NVIDIA-only, TransformerEngine is more tuned today. If you're PyTorch-only, torchao ships native autograd integration. breccia is the substrate for when you need both (and AMD, Trainium, TPU…).

Reading further

  • api.md — exact signatures for every public function and class
  • recipes.md — when to use each of the 6 recipes
  • bridges.md — migration paths from TE / torchao / HF
  • architecture.md — how the package is laid out internally