breccia¶

Block-scaled tensors as a first-class type. Triton FP8 scaled-matmul validated on H100 (cos sim 0.9993 vs FP32). Works on NumPy, PyTorch, MLX, JAX.

import breccia, numpy as np

# Quantize to FP8 with per-block-K scaling (DeepSeek-v3 style)
x = np.random.randn(8, 256).astype(np.float32)
st = breccia.cast(x, breccia.Float8BlockScaling(block_k=128))

# Scaled matmul: data stays in FP8, scales fold into the FP32 accumulator
A = breccia.cast(np.random.randn(16, 256).astype(np.float32), breccia.Float8CurrentScaling())
W = breccia.cast(np.random.randn(256, 128).astype(np.float32), breccia.Float8BlockScaling(block_k=128))
y = breccia.matmul(A, W)

Get started → GitHub → PyPI →

At a glance¶

0.9993

Cos sim vs FP32 (Triton on H100)

4×

Memory savings vs FP32 (FP8 / FP4 / INT4)

6

Recipes covering today's fragmentation

4

Backends: NumPy, PyTorch, MLX, JAX

5

Bridges: TE, torchao, HF, DLPack, DeepSeek

250+

Tests across all backends

Why breccia¶

Block-scaled low-precision is everywhere in modern ML — but every framework carries its own incompatible representation. breccia is the typed primitive that bridges them:

NVIDIA TransformerEngine — 4 non-composable recipe classes, NVIDIA-only → breccia.bridges.from_transformer_engine
PyTorch torchao — AffineQuantizedTensor, PyTorch-only → breccia.bridges.from_torchao
DeepSeek-v3 FP8 weights — private block-scaled format → breccia.bridges.from_deepseek_v3
HuggingFace safetensors — no native scale metadata → breccia.bridges.save_safetensors with recipe + layout preserved
AMD MI355 / Trainium2 / TPU v6 — incompatible scale semantics → one type, four backends today

The cross-vendor gap is widening through 2026–2027 with FP4. No vendor can be the neutral substrate. breccia is the "safetensors of low-precision."

What you can do today¶

Workflow	Use case	Status
FP8 inference	Quantize + scaled matmul end-to-end	native torch.float8_e4m3fn
FP8 training	Forward + STE for gradient flow on PyTorch / JAX	`cast_ste` shipped
DeepSeek-v3 weight loading	Bit-exact `from_deepseek_v3` round-trip	v0.1
Asymmetric INT4 (GPTQ / AWQ)	`INT4Scaling(symmetric=False)` + `zero_point`	v0.1
NVFP4 / MXFP8 quantize	Hardware-spec-locked block sizes (16 / 32)	v0.1
Triton FP8 scaled matmul on Hopper / Ada / Blackwell	DeepSeek-pattern block-scaled GEMM	H100 validated
Cross-framework prototyping	Same `ScaledTensor` on NumPy / PyTorch / MLX / JAX	250+ tests verify

Examples¶

01 — Quickstart: cast + scaled matmul in 15 lines
02 — Recipe-portable training: train MXFP8, ship NVFP4 (same model code)
03 — Checkpoint with scale: safetensors round-trip preserving recipe + layout
04 — TE migration: bridge TransformerEngine Float8Tensor → ScaledTensor

The name¶

A breccia is a sedimentary rock made of broken angular fragments held together by a cementing matrix. Low-precision data fragments + the scale tensor that gives them meaning — same structure.

It's the natural geological successor to scree: loose fragments (scree) become breccia when cemented together.

v0.1.1 on PyPI. Apache-2.0. Source on GitHub · FAQ · Discussions · Issues