Skip to content

Getting started

Install

pip install breccia                    # NumPy backend
pip install "breccia[torch]"           # + PyTorch
pip install "breccia[mlx]"             # + MLX (Apple Silicon, Metal)
pip install "breccia[torch,mlx]"       # + both
pip install "breccia[bridges]"         # + safetensors for HF bridge

breccia supports Python 3.10+. NumPy is the only required dependency. PyTorch, MLX, JAX, Triton, and safetensors are all optional and detected at runtime.

Your first breccia program

import numpy as np
import breccia

# Quantize a tensor to FP8 with per-tensor current scaling.
x = np.random.randn(8, 128).astype(np.float32)
st = breccia.cast(x, breccia.Float8CurrentScaling())

print(st)
# breccia.ScaledTensor(shape=(8, 128), data_dtype=uint8, scale_shape=(),
#                     recipe=Float8CurrentScaling, layout=PerTensor)

# Recover the high-precision tensor.
x_recovered = breccia.dequantize(st)
print(np.allclose(x, x_recovered, atol=0.1))  # True within FP8 precision

That's the core idea: one type (ScaledTensor), six recipes that say how the data was quantized, four layouts that say how the scale maps to the data.

Three patterns you'll use constantly

1. Quantize for compute, dequantize when you need precision back

import breccia

# Block-scaled FP8 weights — better dynamic range than per-tensor.
w_q = breccia.cast(weight, breccia.Float8BlockScaling(block_k=128))

# Use it in matmul (output is float32):
y = breccia.matmul(activations, w_q)

2. Pick a recipe by use case

# Training with TE-style amax history
breccia.DelayedScaling(fp8_format="E4M3")

# Inference with synchronous amax
breccia.Float8CurrentScaling(fp8_format="E4M3")

# DeepSeek-v3-style block-scaled weights
breccia.Float8BlockScaling(block_k=128)

# OCP MX microscaling (32-element blocks, E8M0 scale)
breccia.MXFP8BlockScaling()

# NVIDIA Blackwell NVFP4 (16-element blocks, FP4 + FP8 scale)
breccia.NVFP4BlockScaling()

# INT4 weight-only (GPTQ / AWQ family)
breccia.INT4Scaling(group_size=128)

See recipes.md for the full guide.

3. Bridge to/from your existing pipeline

import breccia.bridges as bridges

# DeepSeek-v3 style: raw (data, scale) buffers → ScaledTensor
st = bridges.from_deepseek_v3(data, scale, block_k=128)

# Save to safetensors with scale metadata
bridges.save_safetensors({"w0": st, "w1": st2}, "weights.safetensors")
loaded = bridges.load_safetensors("weights.safetensors")

# Zero-copy cross-framework via DLPack
st_torch = bridges.from_dlpack(st_numpy, framework="torch")

# TransformerEngine round-trip (Linux + CUDA)
st = bridges.from_transformer_engine(te_float8tensor)
te_t = bridges.to_transformer_engine(st)

See bridges.md for every bridge and its constraints.

Five-line round-trip demo

import numpy as np, breccia
x = np.random.randn(4, 256).astype(np.float32)
st = breccia.cast(x, breccia.Float8BlockScaling(block_k=128))
y = breccia.dequantize(st)
print(f"max abs err: {np.max(np.abs(x - y)):.4f}")
# max abs err: 0.0421   (≈ E4M3 precision)