API reference¶

Every public function and class in breccia, with signatures and behavior.

The library is intentionally small: one data class, four operations, six recipes, four layouts, five bridges, and the reference + Triton kernel modules.

`breccia.ScaledTensor`¶

@dataclass(frozen=True)
class ScaledTensor:
    data: Any        # low-precision bytes
    scale: Any       # scale tensor; shape depends on layout
    recipe: Any      # ScalingRecipe instance
    layout: Any      # Layout instance

Construction¶

ScaledTensor is @dataclass(frozen=True). You can construct directly:

import numpy as np
from breccia import ScaledTensor, Float8CurrentScaling
from breccia.layouts import PerTensor

st = ScaledTensor(
    data=np.zeros((4, 8), dtype=np.uint8),
    scale=np.float32(1.0),
    recipe=Float8CurrentScaling(),
    layout=PerTensor(),
)

…but in practice you go through breccia.cast or breccia.from_buffer.

Invariants (enforced in `__post_init__`)¶

Invariant	Raised
`data` has `.shape` and `.dtype`	`TypeError("data must be array-like ...")`
`scale` has `.shape` and `.dtype`	`TypeError("scale must be array-like ...")`
`data.ndim >= 1`	`ValueError("data must be at least 1-D ...")`
`recipe is not None`	`ValueError("recipe is required ...")`
`layout is not None`	`ValueError("layout is required ...")`
`layout.validate(data, scale)` succeeds	`ValueError` from the layout

Properties¶

Property	Type	Returns
`shape`	`tuple`	`data.shape` as a tuple
`ndim`	`int`	`data.ndim`
`data_dtype`	dtype	`data.dtype`
`scale_dtype`	dtype	`scale.dtype`

Dunder methods¶

repr(st) → "breccia.ScaledTensor(shape=..., data_dtype=..., scale_shape=..., recipe=..., layout=...)"

Core operations¶

`breccia.cast(x, recipe) -> ScaledTensor`¶

Quantize a high-precision tensor.

Parameter	Type	Notes
`x`	NumPy / PyTorch / MLX array	Any backend; the result's `data` and `scale` match the input backend.
`recipe`	ScalingRecipe	One of the 6 recipes.

Block-scaled recipes (Float8BlockScaling, MXFP8BlockScaling, NVFP4BlockScaling, INT4Scaling) require x.ndim >= 2 and the last dim divisible by the block size.

`breccia.dequantize(scaled) -> array`¶

Recover a high-precision tensor from a ScaledTensor. The output backend matches the input (ScaledTensor.data's backend).

`breccia.matmul(a, b, out_dtype=np.float32) -> array`¶

Scaled matmul. Each operand can be a ScaledTensor or a raw array.

Parameter	Type
`a`	`ScaledTensor` or array of shape `(..., M, K)`
`b`	`ScaledTensor` or array of shape `(..., K, N)`
`out_dtype`	NumPy dtype for the output (default `np.float32`)

Returns an array of shape (..., M, N).

The reference implementation dequantizes both operands and runs an FP32 matmul. The Triton kernel (M17) fuses dequantization for FP8.

`breccia.requantize(scaled, new_recipe) -> ScaledTensor`¶

Convert a ScaledTensor between recipes. Implemented in v0.0.1 as cast(dequantize(scaled), new_recipe).

`breccia.from_buffer(data, scale, recipe, layout) -> ScaledTensor`¶

Zero-copy constructor for when you already have quantized buffers (e.g., from a checkpoint or vendor library). No reductions, no quantization — just wraps the buffers in a typed primitive.

Recipes (`breccia.*`)¶

All recipes are frozen dataclasses, hashable, and JSON-serializable.

`DelayedScaling(fp8_format="E4M3", amax_history_len=16, margin=0)`¶

TE-style delayed scaling. See recipes.md → DelayedScaling.

`Float8CurrentScaling(fp8_format="E4M3")`¶

Per-tensor amax computed each step. See recipes.md → Float8CurrentScaling.

`Float8BlockScaling(fp8_format="E4M3", block_k=128)`¶

Per-K-block FP8 scaling. See recipes.md → Float8BlockScaling.

`MXFP8BlockScaling(fp8_format="E4M3", block_size=32)`¶

OCP MX microscaling FP8. block_size is fixed at 32 by the spec. See recipes.md → MXFP8BlockScaling.

`NVFP4BlockScaling(fp4_format="E2M1", block_size=16, scale_format="E4M3")`¶

NVIDIA Blackwell NVFP4. All three fields are fixed by the hardware spec. See recipes.md → NVFP4BlockScaling.

`INT4Scaling(group_size=128, signed=True, scale_dtype="fp16")`¶

GPTQ / AWQ family INT4. See recipes.md → INT4Scaling.

Layouts (`breccia.*`)¶

All layouts implement .validate(data, scale) -> None, called from ScaledTensor.__post_init__.

`PerTensor()`¶

Single scalar scale. Used by DelayedScaling and Float8CurrentScaling.

`PerBlockK(block_size=128)`¶

One scale per block_size-element block along the last axis. Used by Float8BlockScaling and INT4Scaling.

`PerChannel()`¶

One scale per output row. Accepts scale.shape == (M,) or scale.shape == (..., M, 1). Used by INT4 row-wise quantization.

`PerBlockMN(block_m=1, block_n=32)`¶

2-D grid of scales. Used by MXFP8BlockScaling (1, 32) and NVFP4BlockScaling (1, 16).

Bridges (`breccia.bridges`)¶

See bridges.md for full docs and per-bridge constraints.

Function	Direction
`from_transformer_engine(te_t, recipe=None)`	TE → breccia
`to_transformer_engine(scaled)`	breccia → TE
`from_torchao(aqt, recipe=None)`	torchao → breccia
`to_torchao(scaled)`	breccia → torchao
`save_safetensors(tensors, path, extra_metadata=None)`	breccia → safetensors file
`load_safetensors(path)`	safetensors file → breccia
`to_dlpack(scaled)`	breccia → `(data_capsule, scale_capsule)`
`from_dlpack(scaled, framework)`	move buffers to a new framework
`from_deepseek_v3(data, scale, block_k=128, fp8_format="E4M3")`	DeepSeek-v3 → breccia
`to_deepseek_v3(scaled)`	breccia → `(data, scale)`

Kernels¶

`breccia.kernels.reference`¶

Function	What
`cast(x, recipe)`	reference (NumPy) quantize per recipe
`dequantize(scaled)`	reference dequantize
`requantize(scaled, recipe)`	`cast(dequantize(scaled), recipe)`
`matmul(a, b, out_dtype=np.float32)`	reference scaled matmul

These are also exposed at the package root: breccia.cast, etc.

`breccia.kernels.triton`¶

Import-gated. On platforms without Triton (macOS, CPU-only), this module sets TRITON_AVAILABLE = False and exports nothing.

Function	What
`scaled_matmul_triton(a, b)`	FP8 scaled matmul on Hopper / Ada (M17; GPU validation pending)

See kernels.md for kernel design and validation.

`breccia.version`¶

The version string. v0.0.x is pre-alpha. The public API stabilizes at v0.1.

What's NOT in the public API¶

These exist in breccia._core but are not exported:

_is_torch(x), _is_mlx(x), _is_jax(x) — backend dispatch predicates

Anything starting with _ in any module is private and subject to change between any two commits in the v0.0 line.

API reference¶

breccia.ScaledTensor¶

Construction¶

Invariants (enforced in __post_init__)¶

Properties¶

Dunder methods¶

Core operations¶

breccia.cast(x, recipe) -> ScaledTensor¶

breccia.dequantize(scaled) -> array¶

breccia.matmul(a, b, out_dtype=np.float32) -> array¶

breccia.requantize(scaled, new_recipe) -> ScaledTensor¶

breccia.from_buffer(data, scale, recipe, layout) -> ScaledTensor¶

Recipes (breccia.*)¶

DelayedScaling(fp8_format="E4M3", amax_history_len=16, margin=0)¶

Float8CurrentScaling(fp8_format="E4M3")¶

Float8BlockScaling(fp8_format="E4M3", block_k=128)¶

MXFP8BlockScaling(fp8_format="E4M3", block_size=32)¶

NVFP4BlockScaling(fp4_format="E2M1", block_size=16, scale_format="E4M3")¶

INT4Scaling(group_size=128, signed=True, scale_dtype="fp16")¶

Layouts (breccia.*)¶

PerTensor()¶

PerBlockK(block_size=128)¶

PerChannel()¶

PerBlockMN(block_m=1, block_n=32)¶

Bridges (breccia.bridges)¶