Skip to content

Rotation Quantization

Prereqs: INT4 / AWQ / GPTQ (and ideally FP8 Inference). This lesson is the trick that fixes the outlier problem the previous quantization recipes have to work around.

LLM activations are not normally distributed. A few hundred channels — usually around 1% of them — carry magnitudes 50× to 100× the rest. handles this by building per-channel scales that protect those . handles it by spreading the rounding error using a Hessian. Both are fighting the same battle: the outliers exist, so we work around them.

Rotation quantization asks a different question. What if we could just make the outliers go away before quantization sees them?

The answer is yes, and it’s a beautiful piece of linear algebra. Pick an orthogonal matrix R. Apply it to the activation; pre-multiply the next layer’s weights by R⁻¹ offline. The math is identical (R⁻¹W × Rx = Wx), but the intermediate Rx is what gets quantized — and Rx has its mass spread evenly across all coordinates, with no outlier channels at all. Now naive 4-bit quantization works: every value gets full precision because the per-tensor scale isn’t pulled wide by anyone.

This lesson is that trick — what it is, why it works, and why it’s the production default for sub-FP8 quantization in 2026.

TL;DR

  • LLM activations have outliers — a few channels with magnitudes 10–100× larger than typical. They wreck quantization (the scale gets pulled too wide; non-outlier values lose precision).
  • Rotation quantization applies an orthogonal matrix R to the activations and a compensating R⁻¹ to the next layer’s weights. The math is unchanged, but the rotated activations have no outlier channels — their magnitudes are spread evenly.
  • Rotated activations quantize cleanly to INT4 / FP4 / FP8 without outlier-protection tricks like AWQ. QuaRot (Ashkboos et al., 2024) was the first published version; SpinQuant (Liu et al., 2024) learns the rotation; both deliver near-FP16 accuracy at INT4.
  • The rotation cost: one extra small matmul per attention block at runtime — typically under 2% throughput hit. The accuracy gain at low bit-widths can be 2–5 points on MMLU.
  • Where it matters most: aggressive 4-bit and below. For FP8, outliers usually don’t break things. For 4-bit (MXFP4, INT4) and below, rotation is increasingly default — DeepSeek-V3 uses a related trick during training; production INT4 stacks like vLLM v1 ship rotation as an option.

Mental model

R⁻¹W × Rx = W × x. Same math, but the intermediate Rx is what gets quantized, and Rx has no outliers.

The outlier problem

Modern transformer training produces activations with a handful of “fat” channels. Anthropic / various academic studies show that for many models, ~1% of activation channels carry magnitudes ~50× the median. Geometrically: the activation vector lives in a high-dimensional space but has most of its mass concentrated along a few axes.

When you quantize this:

  • Per-tensor scaling sets the scale based on the absolute max → all the non-outlier values become tiny relative to the scale → they round to 0 or 1 quantization level → most of your precision is wasted on capturing outliers.
  • Per-channel scaling (AWQ) protects the outlier channels at the cost of metadata; works but adds complexity.
  • Rotation removes the outliers altogether by spreading their magnitude across all channels.

Why a rotation works

If R is an orthogonal matrix (R⁻¹ = Rᵀ, ||R x|| = ||x||), then Rx has the same Euclidean norm as x but its mass is redistributed across coordinates.

Specifically, if x has one fat channel of magnitude 100 and 4095 others of magnitude 1, then Rx (for a “well-mixing” R like a Hadamard transform) has every coordinate of roughly the same magnitude — the fat-channel mass got smeared across all coordinates.

The math:

  • ‖Rx‖² = xᵀRᵀRx = xᵀx = ‖x‖² (norm preserved)
  • Var(Rx[i]) ≈ ‖x‖²/d for a uniform-mixing R (each coordinate gets a balanced sum of all input coordinates)

A 4096-dim vector with ‖x‖² ≈ 100² (dominated by one outlier) becomes a vector where every coordinate has magnitude ≈ 100/√4096 ≈ 1.5. The 100× ratio between fat and thin channels collapses to roughly 1.

Hadamard rotations

The standard “well-mixing” orthogonal matrix is the : a 2^k × 2^k matrix of ±1/√n whose product with x is computable in O(d log d) time (like an FFT). It’s not learned — it’s a fixed structured matrix.

Hadamard rotations are:

  • Cheap to apply at inference (O(d log d) instead of O(d²) for a generic rotation).
  • No-parameters — you don’t need to store an extra weight matrix.
  • Excellent at outlier flattening in practice; QuaRot uses Hadamard with strong results.

QuaRot’s recipe:

  1. Insert a Hadamard rotation H between every attention block’s input and the projection it feeds into.
  2. Pre-multiply the next layer’s weights by H⁻¹ (offline; one-time).
  3. At inference: H applied to activations is essentially free; quantize Hx; H⁻¹W is the new weight.
  4. Quantize H⁻¹W using your favorite recipe (GPTQ, AWQ, or just symmetric per-block).

End result: 4-bit quantization with quality nearly indistinguishable from FP16.

Learned rotations — SpinQuant

QuaRot fixes R to a Hadamard. SpinQuant asks: can we learn a better rotation per layer? The answer is yes, and the learned rotations outperform Hadamard by ~0.5 pt on aggressive (3-bit) quantization. The cost is a small calibration step (~30 minutes of training on a tiny per layer). For 4-bit, Hadamard is usually good enough; for 3-bit and below, SpinQuant pulls ahead.

SpinQuant parameterizes R as a product of Givens rotations (each rotates a pair of coordinates), making the optimization tractable. The rotation matrices are stored alongside the model — small overhead.

KV-cache rotations

Rotation works equally well for KV-cache . The K and V tensors have outliers; rotating before storage flattens them; the rotation is undone (or absorbed into the next attention’s rotation) on read. This is what enables FP8 KV cache quantization with negligible quality regression on most models.

DeepSeek-V3’s MLA (Multi-head Latent Attention) is conceptually adjacent — the latent compression includes a rotation in the head dimension that has a similar outlier-flattening effect.

Calling it from Python

QuaRot ships as a one-shot offline transform on top of an existing checkpoint:

# QuaRot reference repo — sketched from quarot import rotate_model, quantize_to_int4 # Load the original BF16 model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") # Insert Hadamard rotations and absorb the inverses into adjacent weights rotate_model(model) # offline; runs in minutes # Now naive INT4 quantization works — no more outliers to protect quantize_to_int4(model, group_size=128) model.save_pretrained("./llama-3.1-8b-quarot-int4")

vLLM picks up the rotated checkpoint automatically and uses an INT4 kernel:

from vllm import LLM llm = LLM(model="./llama-3.1-8b-quarot-int4", quantization="quarot")

The runtime cost of rotation is one extra Hadamard per attention block — implemented as a fast O(d log d) kernel that fuses with the surrounding linear projection. Sub-2% throughput hit, multi-point quality win at INT4. Free lunch.

Production picture (April 2026)

  • vLLM v1 ships QuaRot-style rotation as an opt-in for INT4 weight quantization.
  • TensorRT-LLM has its own rotation-aware INT4 path.
  • ExecuTorch + Hexagon (Qualcomm) uses learned rotations as part of QNN quantization.
  • Llama-4 reportedly uses a related rotation in its long-context variants.

The takeaway: by mid-2026, rotation is no longer a “research trick” — it’s the production default for sub-FP8 quantization. Knowing what it does is essential for anyone working with edge or aggressive-server quantization.

Where rotation doesn’t help

  • FP8 quantization rarely needs it — the format’s range is wide enough to handle outliers directly.
  • Models without outliers — some recently-trained models with careful normalization (e.g., applying RMSNorm in a way that keeps activations bounded) have flatter activation distributions and gain less from rotation. Check the magnitude distribution of your model’s activations before rotating.
  • At very low precision (1-bit, ternary) — rotation alone isn’t enough; specialized recipes (BitNet, ternary quantization) handle this differently.

Run it in your browser — outlier flattening with Hadamard

Python — editableGenerate an outlier-heavy activation, apply a Hadamard rotation, watch the channel magnitudes equalize.
Ctrl+Enter to run

The output shows the rotation flattens the channel-magnitude ratio from ~80× to ~2× — exactly the structural change that makes 4-bit / 3-bit quantization tractable.

Quick check

Fill in the blank
The key mathematical property of the rotation matrix R that makes the trick work:
Norm-preserving, R⁻¹ = Rᵀ.
Quick check
A team is quantizing Llama-4-405B to INT4 and seeing 4 pt MMLU regression — way more than the usual 1 pt. The most likely fix:

Key takeaways

  1. LLM activations have outliers; rotation flattens them. That’s the whole game.
  2. R orthogonal → R⁻¹W × Rx = Wx. The compensating R⁻¹ gets absorbed into the weight matrix offline.
  3. Hadamard transforms are the canonical fast R. O(d log d) to apply; no learned parameters; near-optimal mixing.
  4. QuaRot uses fixed Hadamard; SpinQuant learns better rotations for 3-bit and below.
  5. Production default for sub-FP8 quantization in 2026. vLLM, TensorRT-LLM, ExecuTorch all support it; expect every aggressive-INT4 / FP4 stack to bake it in.

Go deeper

Prereqs: INT4 / AWQ / GPTQ (and ideally FP8 Inference). This lesson is the trick that fixes the outlier problem the previous quantization recipes have to work around.

TL;DR

  • LLM activations have outliers — a few channels with magnitudes 10–100× larger than typical. They wreck quantization (the scale gets pulled too wide; non-outlier values lose precision).
  • Rotation quantization applies an orthogonal matrix R to the activations and a compensating R⁻¹ to the next layer’s weights. The math is unchanged, but the rotated activations have no outlier channels — their magnitudes are spread evenly.
  • Rotated activations quantize cleanly to INT4 / FP4 / FP8 without outlier-protection tricks like AWQ. QuaRot (Ashkboos et al., 2024) was the first published version; SpinQuant (Liu et al., 2024) learns the rotation; both deliver near-FP16 accuracy at INT4.
  • The rotation cost: one extra small matmul per attention block at runtime — typically under 2% throughput hit. The accuracy gain at low bit-widths can be 2–5 points on MMLU.
  • Where it matters most: aggressive 4-bit and below. For FP8, outliers usually don’t break things. For 4-bit (MXFP4, INT4) and below, rotation is increasingly default — DeepSeek-V3 uses a related trick during training; production INT4 stacks like vLLM v1 ship rotation as an option.

Why this matters

Outlier channels are the single most common reason “INT4 / FP4 looked great on small models but broke our 70B.” Pre-rotation, the workaround is per-channel calibration (AWQ-style); post-rotation, the problem just goes away. It’s the kind of trick that, once you understand, makes the whole quantization stack feel less fragile. It’s also a beautiful piece of linear algebra — exactly the kind of insight that distinguishes the deep ML systems engineer from the recipe-follower.

Mental model

R⁻¹W × Rx = W × x. Same math, but the intermediate Rx is what gets quantized, and Rx has no outliers.

Concrete walkthrough

The outlier problem

Modern transformer training produces activations with a handful of “fat” channels. Anthropic / various academic studies show that for many models, ~1% of activation channels carry magnitudes ~50× the median. Geometrically: the activation vector lives in a high-dimensional space but has most of its mass concentrated along a few axes.

When you quantize this:

  • Per-tensor scaling sets the scale based on the absolute max → all the non-outlier values become tiny relative to the scale → they round to 0 or 1 quantization level → most of your precision is wasted on capturing outliers.
  • Per-channel scaling (AWQ) protects the outlier channels at the cost of metadata; works but adds complexity.
  • Rotation removes the outliers altogether by spreading their magnitude across all channels.

Why a rotation works

If R is an orthogonal matrix (R⁻¹ = Rᵀ, ||R x|| = ||x||), then Rx has the same Euclidean norm as x but its mass is redistributed across coordinates.

Specifically, if x has one fat channel of magnitude 100 and 4095 others of magnitude 1, then Rx (for a “well-mixing” R like a Hadamard transform) has every coordinate of roughly the same magnitude — the fat-channel mass got smeared across all coordinates.

The math:

  • ‖Rx‖² = xᵀRᵀRx = xᵀx = ‖x‖² (norm preserved)
  • Var(Rx[i]) ≈ ‖x‖²/d for a uniform-mixing R (each coordinate gets a balanced sum of all input coordinates)

A 4096-dim vector with ‖x‖² ≈ 100² (dominated by one outlier) becomes a vector where every coordinate has magnitude ≈ 100/√4096 ≈ 1.5. The 100× ratio between fat and thin channels collapses to roughly 1.

Hadamard rotations

The standard “well-mixing” orthogonal matrix is the Hadamard transform: a 2^k × 2^k matrix of ±1/√n whose product with x is computable in O(d log d) time (like an FFT). It’s not learned — it’s a fixed structured matrix.

Hadamard rotations are:

  • Cheap to apply at inference (O(d log d) instead of O(d²) for a generic rotation).
  • No-parameters — you don’t need to store an extra weight matrix.
  • Excellent at outlier flattening in practice; QuaRot uses Hadamard with strong results.

QuaRot’s recipe:

  1. Insert a Hadamard rotation H between every attention block’s input and the projection it feeds into.
  2. Pre-multiply the next layer’s weights by H⁻¹ (offline; one-time).
  3. At inference: H applied to activations is essentially free; quantize Hx; H⁻¹W is the new weight.
  4. Quantize H⁻¹W using your favorite recipe (GPTQ, AWQ, or just symmetric per-block).

End result: 4-bit quantization with quality nearly indistinguishable from FP16.

Learned rotations — SpinQuant

QuaRot fixes R to a Hadamard. SpinQuant asks: can we learn a better rotation per layer? The answer is yes, and the learned rotations outperform Hadamard by ~0.5 pt on aggressive (3-bit) quantization. The cost is a small calibration step (~30 minutes of training on a tiny calibration set per layer). For 4-bit, Hadamard is usually good enough; for 3-bit and below, SpinQuant pulls ahead.

SpinQuant parameterizes R as a product of Givens rotations (each rotates a pair of coordinates), making the optimization tractable. The rotation matrices are stored alongside the model — small overhead.

KV-cache rotations

Rotation works equally well for KV-cache quantization. The K and V tensors have outliers; rotating before storage flattens them; the rotation is undone (or absorbed into the next attention’s rotation) on read. This is what enables FP8 KV cache quantization with negligible quality regression on most models.

DeepSeek-V3’s MLA (Multi-head Latent Attention) is conceptually adjacent — the latent compression includes a rotation in the head dimension that has a similar outlier-flattening effect.

Production picture (April 2026)

  • vLLM v1 ships QuaRot-style rotation as an opt-in for INT4 weight quantization.
  • TensorRT-LLM has its own rotation-aware INT4 path.
  • ExecuTorch + Hexagon (Qualcomm) uses learned rotations as part of QNN quantization.
  • Llama-4 reportedly uses a related rotation in its long-context variants.

The takeaway: by mid-2026, rotation is no longer a “research trick” — it’s the production default for sub-FP8 quantization. Knowing what it does is essential for anyone working with edge or aggressive-server quantization.

Where rotation doesn’t help

  • FP8 quantization rarely needs it — the format’s range is wide enough to handle outliers directly.
  • Models without outliers — some recently-trained models with careful normalization (e.g., applying RMSNorm in a way that keeps activations bounded) have flatter activation distributions and gain less from rotation. Check the magnitude distribution of your model’s activations before rotating.
  • At very low precision (1-bit, ternary) — rotation alone isn’t enough; specialized recipes (BitNet, ternary quantization) handle this differently.

Run it in your browser — outlier flattening with Hadamard

Python — editableGenerate an outlier-heavy activation, apply a Hadamard rotation, watch the channel magnitudes equalize.
Ctrl+Enter to run

The output shows the rotation flattens the channel-magnitude ratio from ~80× to ~2× — exactly the structural change that makes 4-bit / 3-bit quantization tractable.

Quick check

Fill in the blank
The key mathematical property of the rotation matrix R that makes the trick work:
Norm-preserving, R⁻¹ = Rᵀ.
Quick check
A team is quantizing Llama-4-405B to INT4 and seeing 4 pt MMLU regression — way more than the usual 1 pt. The most likely fix:

Key takeaways

  1. LLM activations have outliers; rotation flattens them. That’s the whole game.
  2. R orthogonal → R⁻¹W × Rx = Wx. The compensating R⁻¹ gets absorbed into the weight matrix offline.
  3. Hadamard transforms are the canonical fast R. O(d log d) to apply; no learned parameters; near-optimal mixing.
  4. QuaRot uses fixed Hadamard; SpinQuant learns better rotations for 3-bit and below.
  5. Production default for sub-FP8 quantization in 2026. vLLM, TensorRT-LLM, ExecuTorch all support it; expect every aggressive-INT4 / FP4 stack to bake it in.

Go deeper