Skip to content

API reference

Every public function and class in breccia, with signatures and behavior.

The library is intentionally small: one data class, four operations, six recipes, four layouts, five bridges, and the reference + Triton kernel modules.


breccia.ScaledTensor

@dataclass(frozen=True)
class ScaledTensor:
    data: Any        # low-precision bytes
    scale: Any       # scale tensor; shape depends on layout
    recipe: Any      # ScalingRecipe instance
    layout: Any      # Layout instance

Construction

ScaledTensor is @dataclass(frozen=True). You can construct directly:

import numpy as np
from breccia import ScaledTensor, Float8CurrentScaling
from breccia.layouts import PerTensor

st = ScaledTensor(
    data=np.zeros((4, 8), dtype=np.uint8),
    scale=np.float32(1.0),
    recipe=Float8CurrentScaling(),
    layout=PerTensor(),
)

…but in practice you go through breccia.cast or breccia.from_buffer.

Invariants (enforced in __post_init__)

Invariant Raised
data has .shape and .dtype TypeError("data must be array-like ...")
scale has .shape and .dtype TypeError("scale must be array-like ...")
data.ndim >= 1 ValueError("data must be at least 1-D ...")
recipe is not None ValueError("recipe is required ...")
layout is not None ValueError("layout is required ...")
layout.validate(data, scale) succeeds ValueError from the layout

Properties

Property Type Returns
shape tuple data.shape as a tuple
ndim int data.ndim
data_dtype dtype data.dtype
scale_dtype dtype scale.dtype

Dunder methods

  • repr(st)"breccia.ScaledTensor(shape=..., data_dtype=..., scale_shape=..., recipe=..., layout=...)"

Core operations

breccia.cast(x, recipe) -> ScaledTensor

Quantize a high-precision tensor.

Parameter Type Notes
x NumPy / PyTorch / MLX array Any backend; the result's data and scale match the input backend.
recipe ScalingRecipe One of the 6 recipes.

Block-scaled recipes (Float8BlockScaling, MXFP8BlockScaling, NVFP4BlockScaling, INT4Scaling) require x.ndim >= 2 and the last dim divisible by the block size.

breccia.dequantize(scaled) -> array

Recover a high-precision tensor from a ScaledTensor. The output backend matches the input (ScaledTensor.data's backend).

breccia.matmul(a, b, out_dtype=np.float32) -> array

Scaled matmul. Each operand can be a ScaledTensor or a raw array.

Parameter Type
a ScaledTensor or array of shape (..., M, K)
b ScaledTensor or array of shape (..., K, N)
out_dtype NumPy dtype for the output (default np.float32)

Returns an array of shape (..., M, N).

The reference implementation dequantizes both operands and runs an FP32 matmul. The Triton kernel (M17) fuses dequantization for FP8.

breccia.requantize(scaled, new_recipe) -> ScaledTensor

Convert a ScaledTensor between recipes. Implemented in v0.0.1 as cast(dequantize(scaled), new_recipe).

breccia.from_buffer(data, scale, recipe, layout) -> ScaledTensor

Zero-copy constructor for when you already have quantized buffers (e.g., from a checkpoint or vendor library). No reductions, no quantization — just wraps the buffers in a typed primitive.


Recipes (breccia.*)

All recipes are frozen dataclasses, hashable, and JSON-serializable.

DelayedScaling(fp8_format="E4M3", amax_history_len=16, margin=0)

TE-style delayed scaling. See recipes.md → DelayedScaling.

Float8CurrentScaling(fp8_format="E4M3")

Per-tensor amax computed each step. See recipes.md → Float8CurrentScaling.

Float8BlockScaling(fp8_format="E4M3", block_k=128)

Per-K-block FP8 scaling. See recipes.md → Float8BlockScaling.

MXFP8BlockScaling(fp8_format="E4M3", block_size=32)

OCP MX microscaling FP8. block_size is fixed at 32 by the spec. See recipes.md → MXFP8BlockScaling.

NVFP4BlockScaling(fp4_format="E2M1", block_size=16, scale_format="E4M3")

NVIDIA Blackwell NVFP4. All three fields are fixed by the hardware spec. See recipes.md → NVFP4BlockScaling.

INT4Scaling(group_size=128, signed=True, scale_dtype="fp16")

GPTQ / AWQ family INT4. See recipes.md → INT4Scaling.


Layouts (breccia.*)

All layouts implement .validate(data, scale) -> None, called from ScaledTensor.__post_init__.

PerTensor()

Single scalar scale. Used by DelayedScaling and Float8CurrentScaling.

PerBlockK(block_size=128)

One scale per block_size-element block along the last axis. Used by Float8BlockScaling and INT4Scaling.

PerChannel()

One scale per output row. Accepts scale.shape == (M,) or scale.shape == (..., M, 1). Used by INT4 row-wise quantization.

PerBlockMN(block_m=1, block_n=32)

2-D grid of scales. Used by MXFP8BlockScaling (1, 32) and NVFP4BlockScaling (1, 16).


Bridges (breccia.bridges)

See bridges.md for full docs and per-bridge constraints.

Function Direction
from_transformer_engine(te_t, recipe=None) TE → breccia
to_transformer_engine(scaled) breccia → TE
from_torchao(aqt, recipe=None) torchao → breccia
to_torchao(scaled) breccia → torchao
save_safetensors(tensors, path, extra_metadata=None) breccia → safetensors file
load_safetensors(path) safetensors file → breccia
to_dlpack(scaled) breccia → (data_capsule, scale_capsule)
from_dlpack(scaled, framework) move buffers to a new framework
from_deepseek_v3(data, scale, block_k=128, fp8_format="E4M3") DeepSeek-v3 → breccia
to_deepseek_v3(scaled) breccia → (data, scale)

Kernels

breccia.kernels.reference

Function What
cast(x, recipe) reference (NumPy) quantize per recipe
dequantize(scaled) reference dequantize
requantize(scaled, recipe) cast(dequantize(scaled), recipe)
matmul(a, b, out_dtype=np.float32) reference scaled matmul

These are also exposed at the package root: breccia.cast, etc.

breccia.kernels.triton

Import-gated. On platforms without Triton (macOS, CPU-only), this module sets TRITON_AVAILABLE = False and exports nothing.

Function What
scaled_matmul_triton(a, b) FP8 scaled matmul on Hopper / Ada (M17; GPU validation pending)

See kernels.md for kernel design and validation.


breccia.__version__

The version string. v0.0.x is pre-alpha. The public API stabilizes at v0.1.


What's NOT in the public API

These exist in breccia._core but are not exported:

  • _is_torch(x), _is_mlx(x), _is_jax(x) — backend dispatch predicates

Anything starting with _ in any module is private and subject to change between any two commits in the v0.0 line.