Rotation Quantization
Prereqs: INT4 / AWQ / GPTQ (and ideally FP8 Inference). This lesson is the trick that fixes the outlier problem the previous quantization recipes have to work around.
LLM activations are not normally distributed. A few hundred channels — usually around 1% of them — carry magnitudes 50× to 100× the rest. handles this by building per-channel scales that protect those . handles it by spreading the rounding error using a Hessian. Both are fighting the same battle: the outliers exist, so we work around them.
Rotation quantization asks a different question. What if we could just make the outliers go away before quantization sees them?
The answer is yes, and it’s a beautiful piece of linear algebra. Pick an orthogonal matrix R. Apply it to the activation; pre-multiply the next layer’s weights by R⁻¹ offline. The math is identical (R⁻¹W × Rx = Wx), but the intermediate Rx is what gets quantized — and Rx has its mass spread evenly across all coordinates, with no outlier channels at all. Now naive 4-bit quantization works: every value gets full precision because the per-tensor scale isn’t pulled wide by anyone.
This lesson is that trick — what it is, why it works, and why it’s the production default for sub-FP8 quantization in 2026.
TL;DR
- LLM activations have outliers — a few channels with magnitudes 10–100× larger than typical. They wreck quantization (the scale gets pulled too wide; non-outlier values lose precision).
- Rotation quantization applies an orthogonal matrix
Rto the activations and a compensatingR⁻¹to the next layer’s weights. The math is unchanged, but the rotated activations have no outlier channels — their magnitudes are spread evenly. - Rotated activations quantize cleanly to INT4 / FP4 / FP8 without outlier-protection tricks like AWQ. QuaRot (Ashkboos et al., 2024) was the first published version; SpinQuant (Liu et al., 2024) learns the rotation; both deliver near-FP16 accuracy at INT4.
- The rotation cost: one extra small matmul per attention block at runtime — typically under 2% throughput hit. The accuracy gain at low bit-widths can be 2–5 points on MMLU.
- Where it matters most: aggressive 4-bit and below. For FP8, outliers usually don’t break things. For 4-bit (MXFP4, INT4) and below, rotation is increasingly default — DeepSeek-V3 uses a related trick during training; production INT4 stacks like vLLM v1 ship rotation as an option.
Mental model
R⁻¹W × Rx = W × x. Same math, but the intermediate Rx is what gets quantized, and Rx has no outliers.
The outlier problem
Modern transformer training produces activations with a handful of “fat” channels. Anthropic / various academic studies show that for many models, ~1% of activation channels carry magnitudes ~50× the median. Geometrically: the activation vector lives in a high-dimensional space but has most of its mass concentrated along a few axes.
When you quantize this:
- Per-tensor scaling sets the scale based on the absolute max → all the non-outlier values become tiny relative to the scale → they round to 0 or 1 quantization level → most of your precision is wasted on capturing outliers.
- Per-channel scaling (AWQ) protects the outlier channels at the cost of metadata; works but adds complexity.
- Rotation removes the outliers altogether by spreading their magnitude across all channels.
Why a rotation works
If R is an orthogonal matrix (R⁻¹ = Rᵀ, ||R x|| = ||x||), then Rx has the same Euclidean norm as x but its mass is redistributed across coordinates.
Specifically, if x has one fat channel of magnitude 100 and 4095 others of magnitude 1, then Rx (for a “well-mixing” R like a Hadamard transform) has every coordinate of roughly the same magnitude — the fat-channel mass got smeared across all coordinates.
The math:
‖Rx‖² = xᵀRᵀRx = xᵀx = ‖x‖²(norm preserved)Var(Rx[i]) ≈ ‖x‖²/dfor a uniform-mixingR(each coordinate gets a balanced sum of all input coordinates)
A 4096-dim vector with ‖x‖² ≈ 100² (dominated by one outlier) becomes a vector where every coordinate has magnitude ≈ 100/√4096 ≈ 1.5. The 100× ratio between fat and thin channels collapses to roughly 1.
Hadamard rotations
The standard “well-mixing” orthogonal matrix is the : a 2^k × 2^k matrix of ±1/√n whose product with x is computable in O(d log d) time (like an FFT). It’s not learned — it’s a fixed structured matrix.
Hadamard rotations are:
- Cheap to apply at inference (
O(d log d)instead ofO(d²)for a generic rotation). - No-parameters — you don’t need to store an extra weight matrix.
- Excellent at outlier flattening in practice; QuaRot uses Hadamard with strong results.
QuaRot’s recipe:
- Insert a Hadamard rotation
Hbetween every attention block’s input and the projection it feeds into. - Pre-multiply the next layer’s weights by
H⁻¹(offline; one-time). - At inference:
Happlied to activations is essentially free; quantizeHx;H⁻¹Wis the new weight. - Quantize
H⁻¹Wusing your favorite recipe (GPTQ, AWQ, or just symmetric per-block).
End result: 4-bit quantization with quality nearly indistinguishable from FP16.
Learned rotations — SpinQuant
QuaRot fixes R to a Hadamard. SpinQuant asks: can we learn a better rotation per layer? The answer is yes, and the learned rotations outperform Hadamard by ~0.5 pt on aggressive (3-bit) quantization. The cost is a small calibration step (~30 minutes of training on a tiny per layer). For 4-bit, Hadamard is usually good enough; for 3-bit and below, SpinQuant pulls ahead.
SpinQuant parameterizes R as a product of Givens rotations (each rotates a pair of coordinates), making the optimization tractable. The rotation matrices are stored alongside the model — small overhead.
KV-cache rotations
Rotation works equally well for KV-cache . The K and V tensors have outliers; rotating before storage flattens them; the rotation is undone (or absorbed into the next attention’s rotation) on read. This is what enables FP8 KV cache quantization with negligible quality regression on most models.
DeepSeek-V3’s MLA (Multi-head Latent Attention) is conceptually adjacent — the latent compression includes a rotation in the head dimension that has a similar outlier-flattening effect.
Calling it from Python
QuaRot ships as a one-shot offline transform on top of an existing checkpoint:
# QuaRot reference repo — sketched
from quarot import rotate_model, quantize_to_int4
# Load the original BF16 model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# Insert Hadamard rotations and absorb the inverses into adjacent weights
rotate_model(model) # offline; runs in minutes
# Now naive INT4 quantization works — no more outliers to protect
quantize_to_int4(model, group_size=128)
model.save_pretrained("./llama-3.1-8b-quarot-int4")vLLM picks up the rotated checkpoint automatically and uses an INT4 kernel:
from vllm import LLM
llm = LLM(model="./llama-3.1-8b-quarot-int4", quantization="quarot")The runtime cost of rotation is one extra Hadamard per attention block — implemented as a fast O(d log d) kernel that fuses with the surrounding linear projection. Sub-2% throughput hit, multi-point quality win at INT4. Free lunch.
Production picture (April 2026)
- vLLM v1 ships QuaRot-style rotation as an opt-in for INT4 weight quantization.
- TensorRT-LLM has its own rotation-aware INT4 path.
- ExecuTorch + Hexagon (Qualcomm) uses learned rotations as part of QNN quantization.
- Llama-4 reportedly uses a related rotation in its long-context variants.
The takeaway: by mid-2026, rotation is no longer a “research trick” — it’s the production default for sub-FP8 quantization. Knowing what it does is essential for anyone working with edge or aggressive-server quantization.
Where rotation doesn’t help
- FP8 quantization rarely needs it — the format’s range is wide enough to handle outliers directly.
- Models without outliers — some recently-trained models with careful normalization (e.g., applying RMSNorm in a way that keeps activations bounded) have flatter activation distributions and gain less from rotation. Check the magnitude distribution of your model’s activations before rotating.
- At very low precision (1-bit, ternary) — rotation alone isn’t enough; specialized recipes (BitNet, ternary quantization) handle this differently.
Run it in your browser — outlier flattening with Hadamard
The output shows the rotation flattens the channel-magnitude ratio from ~80× to ~2× — exactly the structural change that makes 4-bit / 3-bit quantization tractable.
Quick check
Key takeaways
- LLM activations have outliers; rotation flattens them. That’s the whole game.
Rorthogonal →R⁻¹W × Rx = Wx. The compensatingR⁻¹gets absorbed into the weight matrix offline.- Hadamard transforms are the canonical fast
R. O(d log d) to apply; no learned parameters; near-optimal mixing. - QuaRot uses fixed Hadamard; SpinQuant learns better rotations for 3-bit and below.
- Production default for sub-FP8 quantization in 2026. vLLM, TensorRT-LLM, ExecuTorch all support it; expect every aggressive-INT4 / FP4 stack to bake it in.
Go deeper
- PaperQuaRot: Outlier-Free 4-Bit Inference in Rotated LLMsThe QuaRot paper. Section 3 has the math; section 5 the empirical comparison with AWQ/GPTQ.
- PaperSpinQuant: LLM Quantization with Learned RotationsThe learned-rotation paper. Pulls ahead at 3-bit and below.
- PaperSmoothQuant: Accurate and Efficient Post-Training Quantization for LLMsThe pre-rotation per-channel-scaling alternative. Useful background.
- PaperLLM.int8(): 8-bit Matrix Multiplication for Transformers at ScaleThe paper that first identified the outlier problem in detail. Foundational for everything since.
- DocsvLLM — Quark / Rotation-Aware QuantizationProduction knobs for rotation-based INT4 quantization in vLLM v1.
- Repospcl/QuaRotReference QuaRot implementation. The Hadamard injection points map directly to the paper figures.
- Repofacebookresearch/SpinQuantReference SpinQuant. The learning loop that finds optimal rotations is in `optimize_rotation/`.
Prereqs: INT4 / AWQ / GPTQ (and ideally FP8 Inference). This lesson is the trick that fixes the outlier problem the previous quantization recipes have to work around.
TL;DR
- LLM activations have outliers — a few channels with magnitudes 10–100× larger than typical. They wreck quantization (the scale gets pulled too wide; non-outlier values lose precision).
- Rotation quantization applies an orthogonal matrix
Rto the activations and a compensatingR⁻¹to the next layer’s weights. The math is unchanged, but the rotated activations have no outlier channels — their magnitudes are spread evenly. - Rotated activations quantize cleanly to INT4 / FP4 / FP8 without outlier-protection tricks like AWQ. QuaRot (Ashkboos et al., 2024) was the first published version; SpinQuant (Liu et al., 2024) learns the rotation; both deliver near-FP16 accuracy at INT4.
- The rotation cost: one extra small matmul per attention block at runtime — typically under 2% throughput hit. The accuracy gain at low bit-widths can be 2–5 points on MMLU.
- Where it matters most: aggressive 4-bit and below. For FP8, outliers usually don’t break things. For 4-bit (MXFP4, INT4) and below, rotation is increasingly default — DeepSeek-V3 uses a related trick during training; production INT4 stacks like vLLM v1 ship rotation as an option.
Why this matters
Outlier channels are the single most common reason “INT4 / FP4 looked great on small models but broke our 70B.” Pre-rotation, the workaround is per-channel calibration (AWQ-style); post-rotation, the problem just goes away. It’s the kind of trick that, once you understand, makes the whole quantization stack feel less fragile. It’s also a beautiful piece of linear algebra — exactly the kind of insight that distinguishes the deep ML systems engineer from the recipe-follower.
Mental model
R⁻¹W × Rx = W × x. Same math, but the intermediate Rx is what gets quantized, and Rx has no outliers.
Concrete walkthrough
The outlier problem
Modern transformer training produces activations with a handful of “fat” channels. Anthropic / various academic studies show that for many models, ~1% of activation channels carry magnitudes ~50× the median. Geometrically: the activation vector lives in a high-dimensional space but has most of its mass concentrated along a few axes.
When you quantize this:
- Per-tensor scaling sets the scale based on the absolute max → all the non-outlier values become tiny relative to the scale → they round to 0 or 1 quantization level → most of your precision is wasted on capturing outliers.
- Per-channel scaling (AWQ) protects the outlier channels at the cost of metadata; works but adds complexity.
- Rotation removes the outliers altogether by spreading their magnitude across all channels.
Why a rotation works
If R is an orthogonal matrix (R⁻¹ = Rᵀ, ||R x|| = ||x||), then Rx has the same Euclidean norm as x but its mass is redistributed across coordinates.
Specifically, if x has one fat channel of magnitude 100 and 4095 others of magnitude 1, then Rx (for a “well-mixing” R like a Hadamard transform) has every coordinate of roughly the same magnitude — the fat-channel mass got smeared across all coordinates.
The math:
‖Rx‖² = xᵀRᵀRx = xᵀx = ‖x‖²(norm preserved)Var(Rx[i]) ≈ ‖x‖²/dfor a uniform-mixingR(each coordinate gets a balanced sum of all input coordinates)
A 4096-dim vector with ‖x‖² ≈ 100² (dominated by one outlier) becomes a vector where every coordinate has magnitude ≈ 100/√4096 ≈ 1.5. The 100× ratio between fat and thin channels collapses to roughly 1.
Hadamard rotations
The standard “well-mixing” orthogonal matrix is the Hadamard transform: a 2^k × 2^k matrix of ±1/√n whose product with x is computable in O(d log d) time (like an FFT). It’s not learned — it’s a fixed structured matrix.
Hadamard rotations are:
- Cheap to apply at inference (
O(d log d)instead ofO(d²)for a generic rotation). - No-parameters — you don’t need to store an extra weight matrix.
- Excellent at outlier flattening in practice; QuaRot uses Hadamard with strong results.
QuaRot’s recipe:
- Insert a Hadamard rotation
Hbetween every attention block’s input and the projection it feeds into. - Pre-multiply the next layer’s weights by
H⁻¹(offline; one-time). - At inference:
Happlied to activations is essentially free; quantizeHx;H⁻¹Wis the new weight. - Quantize
H⁻¹Wusing your favorite recipe (GPTQ, AWQ, or just symmetric per-block).
End result: 4-bit quantization with quality nearly indistinguishable from FP16.
Learned rotations — SpinQuant
QuaRot fixes R to a Hadamard. SpinQuant asks: can we learn a better rotation per layer? The answer is yes, and the learned rotations outperform Hadamard by ~0.5 pt on aggressive (3-bit) quantization. The cost is a small calibration step (~30 minutes of training on a tiny calibration set per layer). For 4-bit, Hadamard is usually good enough; for 3-bit and below, SpinQuant pulls ahead.
SpinQuant parameterizes R as a product of Givens rotations (each rotates a pair of coordinates), making the optimization tractable. The rotation matrices are stored alongside the model — small overhead.
KV-cache rotations
Rotation works equally well for KV-cache quantization. The K and V tensors have outliers; rotating before storage flattens them; the rotation is undone (or absorbed into the next attention’s rotation) on read. This is what enables FP8 KV cache quantization with negligible quality regression on most models.
DeepSeek-V3’s MLA (Multi-head Latent Attention) is conceptually adjacent — the latent compression includes a rotation in the head dimension that has a similar outlier-flattening effect.
Production picture (April 2026)
- vLLM v1 ships QuaRot-style rotation as an opt-in for INT4 weight quantization.
- TensorRT-LLM has its own rotation-aware INT4 path.
- ExecuTorch + Hexagon (Qualcomm) uses learned rotations as part of QNN quantization.
- Llama-4 reportedly uses a related rotation in its long-context variants.
The takeaway: by mid-2026, rotation is no longer a “research trick” — it’s the production default for sub-FP8 quantization. Knowing what it does is essential for anyone working with edge or aggressive-server quantization.
Where rotation doesn’t help
- FP8 quantization rarely needs it — the format’s range is wide enough to handle outliers directly.
- Models without outliers — some recently-trained models with careful normalization (e.g., applying RMSNorm in a way that keeps activations bounded) have flatter activation distributions and gain less from rotation. Check the magnitude distribution of your model’s activations before rotating.
- At very low precision (1-bit, ternary) — rotation alone isn’t enough; specialized recipes (BitNet, ternary quantization) handle this differently.
Run it in your browser — outlier flattening with Hadamard
The output shows the rotation flattens the channel-magnitude ratio from ~80× to ~2× — exactly the structural change that makes 4-bit / 3-bit quantization tractable.
Quick check
Key takeaways
- LLM activations have outliers; rotation flattens them. That’s the whole game.
Rorthogonal →R⁻¹W × Rx = Wx. The compensatingR⁻¹gets absorbed into the weight matrix offline.- Hadamard transforms are the canonical fast
R. O(d log d) to apply; no learned parameters; near-optimal mixing. - QuaRot uses fixed Hadamard; SpinQuant learns better rotations for 3-bit and below.
- Production default for sub-FP8 quantization in 2026. vLLM, TensorRT-LLM, ExecuTorch all support it; expect every aggressive-INT4 / FP4 stack to bake it in.
Go deeper
- PaperQuaRot: Outlier-Free 4-Bit Inference in Rotated LLMsThe QuaRot paper. Section 3 has the math; section 5 the empirical comparison with AWQ/GPTQ.
- PaperSpinQuant: LLM Quantization with Learned RotationsThe learned-rotation paper. Pulls ahead at 3-bit and below.
- PaperSmoothQuant: Accurate and Efficient Post-Training Quantization for LLMsThe pre-rotation per-channel-scaling alternative. Useful background.
- PaperLLM.int8(): 8-bit Matrix Multiplication for Transformers at ScaleThe paper that first identified the outlier problem in detail. Foundational for everything since.
- DocsvLLM — Quark / Rotation-Aware QuantizationProduction knobs for rotation-based INT4 quantization in vLLM v1.
- Repospcl/QuaRotReference QuaRot implementation. The Hadamard injection points map directly to the paper figures.
- Repofacebookresearch/SpinQuantReference SpinQuant. The learning loop that finds optimal rotations is in `optimize_rotation/`.