GGUF & i-matrix
Prereqs: llama.cpp Internals, INT4 / AWQ / GPTQ. This lesson is about what makes the local-LLM 4-bit experience actually high-quality.
In a cloud-serving stack, “the model” is a directory of FP16 or BF16 safetensors with a config.json next to it. The runtime loads the whole thing into HBM and runs it as-is. There’s no quantization step on the deployment path — quantization, if any, lives somewhere upstream in your training pipeline.
On a phone or laptop, FP16 isn’t on the table — a 70B model in FP16 is 140 GB, a 7B model is 14 GB, neither fits in user-grade RAM. So the local-LLM ecosystem has spent three years building a quantization story that gets you 4-bit weights without making the model feel measurably stupider. The format is ; the recipe is ; and the secret sauce that pushes Q4_K_M from “okay” to “indistinguishable from FP16” is a per-channel calibration vector called the i-matrix.
The shift from cloud thinking is subtle but important. In the cloud, the file is the weights. On the device, the file carries weights plus the recipe that quantized them — and a careful enough recipe is the difference between a 3B model that feels stupid and one that feels like the API. This lesson is about that recipe.
TL;DR
- GGUF carries quantized weights plus the recipe that quantized them. The K-quant variants (Q4_K_M, Q5_K_M, etc.) are baked into the format spec.
- K-quants store per-super-block (256 weights) statistics plus per-sub-block (16/32 weights) refinements. More precision than naive INT4, no calibration data needed for “OK” results.
- The i-matrix (importance matrix) is a calibration-driven addition. Compute activation statistics on a calibration dataset; weight the quantization-error metric by importance per channel; feed back into the K-quant chooser. Bumps quality of Q4_K_M / Q3_K from “OK” to “indistinguishable from FP16.”
- The i-matrix is optional but essentially free to compute (~5 minutes on a calibration set). Most modern GGUF releases on Hugging Face ship i-matrix-quantized variants.
- Reading per-quant size and quality numbers fluently — Q4_K_M ≈ 4.5 bits, Q5_K_M ≈ 5.5, Q6_K ≈ 6.6, Q8_0 ≈ 8.5 — is the price of admission for any local-LLM conversation in 2026.
Why this matters
The local-LLM ecosystem (llama.cpp, Ollama, LM Studio, Jan, the entire Hugging Face GGUF tag) lives on K-quants and i-matrix calibration. The difference between “free Llama-3.2-3B feels stupid” and “feels like the API” is almost entirely the quantization recipe, not the model. Engineers who ship to consumer-facing local AI need to know which quant to ship, why, and how the i-matrix flow works.
Mental model
The i-matrix is what makes “important” channels get more bits and “unimportant” ones get fewer — same total bit budget, less perplexity loss.
K-quant structure recap
A K-quant lays out weights as super-blocks of 256. Within each super-block:
- A super-block header — typically a few bytes of statistics (min, max, scale, zero-point).
- Sub-blocks of 16 or 32 weights, each with their own refined parameters.
- The 4-bit (or 3-bit, 5-bit) weight indices, packed.
For Q4_K_M:
- Super-block: 256 weights.
- Each super-block: 12 bytes of header (FP16 + extras).
- Sub-blocks: 8 sub-blocks of 32 weights, each with a 6-bit min and 6-bit scale.
- Weight bits: 4 per weight × 256 = 128 bytes per super-block.
- Total: 12 (header) + 8×6/4 + 8×6/4 (mins/scales packed) + 128 (weights) = ~150 bytes per 256 weights ≈ 4.5 bits/weight effective.
Compare to bare INT4 with one BF16 scale per group of 128: ~4.125 bits/weight. The K-quants pay 0.4 bits/weight for the sub-block refinement — and that 0.4 bits buys most of the accuracy back.
The “M” variant (Q4_K_M, Q5_K_M) goes further: it stores some “important” weights — typically the FFN down-projection — at 6 bits while the rest stay at 4 bits. Adaptive precision per layer.
Why i-matrix?
Without an i-matrix, K-quants pick which weights to give more bits via static heuristics (this layer is FFN, that one is attention, etc.). Heuristics are decent but blind to the actual weight-to-activation interaction in the model.
The i-matrix is a per-input-channel weight (literally: a vector per layer) that captures how much each channel matters based on calibration activations. The math:
where is the i-th input channel of activations on the . Channels that consistently have large activations are “important” — quantizing their weights badly causes large output errors.
The K-quant chooser, given an i-matrix, scales the per-block error metric:
and picks scales/zero-points that minimize this weighted error. Important weights get tighter scales (less rounding error); unimportant weights tolerate looser scales.
The shape is similar to ’s per-channel scaling but the implementation is simpler — no offline weight rescaling, just a different inner-loop metric during the same K-quant procedure.
Computing the i-matrix in practice
The CLI is part of llama.cpp — written in C, runs on the same ggml core that does inference. Calibration is a one-time offline step:
# 1. Convert your model to a base GGUF (this Python script ships with llama.cpp)
python convert_hf_to_gguf.py meta-llama/Llama-3.2-3B-Instruct
# 2. Compute the i-matrix on a calibration set
./imatrix -m llama-3.2-3b-f16.gguf -f calibration.txt -o llama-3.2-3b.imatrix
# 3. Quantize using the i-matrix
./quantize --imatrix llama-3.2-3b.imatrix llama-3.2-3b-f16.gguf llama-3.2-3b-q4_k_m.gguf Q4_K_MCalibration set: 100–500 sequences from the same domain as your target use case. WikiText is a generic default; for code-heavy use cases, calibrate on code; for chat, on chat.
The i-matrix file is small (~MB scale). It’s computed once per model and reused across quant levels — the same i-matrix can produce Q4_K_M, Q5_K_M, Q3_K, etc.
Quality impact, with numbers
Approximate MMLU drops vs FP16 (median across recent Llama / Qwen / DeepSeek-distill checkpoints):
| Quant | Without i-matrix | With i-matrix |
|---|---|---|
| Q8_0 | -0.1 | -0.1 |
| Q6_K | -0.3 | -0.2 |
| Q5_K_M | -0.7 | -0.4 |
| Q4_K_M | -1.5 | -0.8 |
| Q3_K_M | -3.5 | -2.0 |
| Q2_K | -8.0 | -5.5 |
The i-matrix gives back roughly half the regression. For Q4_K_M, the production sweet spot, that’s the difference between “noticeable” and “imperceptible” in chat use.
Picking a quant in 2026
- Q4_K_M with i-matrix: the universal default for local LLMs ≥ 3B.
- Q5_K_M with i-matrix: when you have headroom (memory or disk) and want lossless-feeling quality.
- Q6_K: near-lossless; rarely worth it vs Q5_K_M.
- Q8_0: a “sanity baseline” — basically as good as FP16, twice the file size of Q4_K_M.
- Q3_K_M: only when you can’t fit Q4 (typically low-RAM devices). With i-matrix, Q3_K is usable.
- Q2_K: aggressive edge; only when the device can’t afford anything else. Visible quality drop.
Hugging Face GGUF ecosystem
The bartowski and LoneStriker users on Hugging Face publish thousands of quantized GGUF variants per popular base model — typically all the K-quant levels with i-matrix calibration. Convention:
Llama-3.2-3B-Instruct-Q4_K_M.gguf— pre-computed i-matrix-quantized.Llama-3.2-3B-Instruct.imatrix— the i-matrix file, separately.
For most engineers in 2026, “computing your own i-matrix” is unnecessary — the community has done it for every model that matters. Knowing the format and the recipe is what lets you debug or recompute when you need to.
Run it in your browser — bit-budget calculator with i-matrix
The output is the same memory-fit table local-LLM enthusiasts memorize. Knowing it cold (or having this calculator handy) is half of “ship to a phone” engineering.
Quick check
Key takeaways
- GGUF carries the quantization recipe inside the file. K-quants are baked into the format.
- Q4_K_M ≈ 4.5 effective bits/weight; M variants store some weights at 6-bit. Memorize the bits-per-weight table.
- The i-matrix is calibration-driven importance weighting. Cuts Q4_K_M’s MMLU regression roughly in half.
- Computing an i-matrix is cheap (~5 min) and usually pre-computed by the Hugging Face community per model.
- Q4_K_M with i-matrix is the 2026 universal default for local LLMs ≥ 3B.
Go deeper
- Docsllama.cpp — Quantization GuideAuthoritative. The K-quant table + i-matrix flow + recommendations per model size.
- Docsllama.cpp — imatrix toolThe CLI that computes i-matrices. README has the calibration-set advice.
- Blogllama.cpp — i-matrix Quantization DiscussionThe original discussion thread that introduced i-matrix to the codebase. Useful for the "why" rather than the "what."
- DocsHugging Face — GGUF DocumentationHow GGUF integrates with transformers (load GGUF directly into a HF pipeline).
- BlogHugging Face Blog — llamafileMozilla's single-binary llama.cpp distribution. The most ergonomic way to ship GGUFs to non-developer users.
- Repobartowski on Hugging FaceThe most prolific GGUF re-quantizer; high-quality i-matrix-calibrated variants for nearly every popular base model. The de facto index.
Prereqs: llama.cpp Internals, INT4 / AWQ / GPTQ. This lesson is about what makes the local-LLM 4-bit experience actually high-quality.
TL;DR
- GGUF carries quantized weights plus the recipe that quantized them. The K-quant variants (Q4_K_M, Q5_K_M, etc.) are baked into the format spec.
- K-quants store per-super-block (256 weights) statistics plus per-sub-block (16/32 weights) refinements. More precision than naive INT4, no calibration data needed for “OK” results.
- The i-matrix (importance matrix) is a calibration-driven addition. Compute activation statistics on a calibration dataset; weight the quantization-error metric by importance per channel; feed back into the K-quant chooser. Bumps quality of Q4_K_M / Q3_K from “OK” to “indistinguishable from FP16.”
- The i-matrix is optional but essentially free to compute (~5 minutes on a calibration set). Most modern GGUF releases on Hugging Face ship i-matrix-quantized variants.
- Reading per-quant size and quality numbers fluently — Q4_K_M ≈ 4.5 bits, Q5_K_M ≈ 5.5, Q6_K ≈ 6.6, Q8_0 ≈ 8.5 — is the price of admission for any local-LLM conversation in 2026.
Why this matters
The local-LLM ecosystem (llama.cpp, Ollama, LM Studio, Jan, the entire Hugging Face GGUF tag) lives on K-quants and i-matrix calibration. The difference between “free Llama-3.2-3B feels stupid” and “feels like the API” is almost entirely the quantization recipe, not the model. Engineers who ship to consumer-facing local AI need to know which quant to ship, why, and how the i-matrix flow works.
Mental model
The i-matrix is what makes “important” channels get more bits and “unimportant” ones get fewer — same total bit budget, less perplexity loss.
Concrete walkthrough
K-quant structure recap
A K-quant lays out weights as super-blocks of 256. Within each super-block:
- A super-block header — typically a few bytes of statistics (min, max, scale, zero-point).
- Sub-blocks of 16 or 32 weights, each with their own refined parameters.
- The 4-bit (or 3-bit, 5-bit) weight indices, packed.
For Q4_K_M:
- Super-block: 256 weights.
- Each super-block: 12 bytes of header (FP16 + extras).
- Sub-blocks: 8 sub-blocks of 32 weights, each with a 6-bit min and 6-bit scale.
- Weight bits: 4 per weight × 256 = 128 bytes per super-block.
- Total: 12 (header) + 8×6/4 + 8×6/4 (mins/scales packed) + 128 (weights) = ~150 bytes per 256 weights ≈ 4.5 bits/weight effective.
Compare to bare INT4 with one BF16 scale per group of 128: ~4.125 bits/weight. The K-quants pay 0.4 bits/weight for the sub-block refinement — and that 0.4 bits buys most of the accuracy back.
The “M” variant (Q4_K_M, Q5_K_M) goes further: it stores some “important” weights — typically the FFN down-projection — at 6 bits while the rest stay at 4 bits. Adaptive precision per layer.
Why i-matrix?
Without an i-matrix, K-quants pick which weights to give more bits via static heuristics (this layer is FFN, that one is attention, etc.). Heuristics are decent but blind to the actual weight-to-activation interaction in the model.
The i-matrix is a per-input-channel weight (literally: a vector per layer) that captures how much each channel matters based on calibration activations. The math:
where is the i-th input channel of activations on the calibration set . Channels that consistently have large activations are “important” — quantizing their weights badly causes large output errors.
The K-quant chooser, given an i-matrix, scales the per-block error metric:
and picks scales/zero-points that minimize this weighted error. Important weights get tighter scales (less rounding error); unimportant weights tolerate looser scales.
The shape is similar to AWQ’s per-channel scaling but the implementation is simpler — no offline weight rescaling, just a different inner-loop metric during the same K-quant procedure.
Computing the i-matrix in practice
# 1. Convert your model to a base GGUF
python convert_hf_to_gguf.py meta-llama/Llama-3.2-3B-Instruct
# 2. Compute the i-matrix on a calibration set
./imatrix -m llama-3.2-3b-f16.gguf -f calibration.txt -o llama-3.2-3b.imatrix
# 3. Quantize using the i-matrix
./quantize --imatrix llama-3.2-3b.imatrix llama-3.2-3b-f16.gguf llama-3.2-3b-q4_k_m.gguf Q4_K_MCalibration set: 100–500 sequences from the same domain as your target use case. WikiText is a generic default; for code-heavy use cases, calibrate on code; for chat, on chat.
The i-matrix file is small (~MB scale). It’s computed once per model and reused across quant levels — the same i-matrix can produce Q4_K_M, Q5_K_M, Q3_K, etc.
Quality impact, with numbers
Approximate MMLU drops vs FP16 (median across recent Llama / Qwen / DeepSeek-distill checkpoints):
| Quant | Without i-matrix | With i-matrix |
|---|---|---|
| Q8_0 | -0.1 | -0.1 |
| Q6_K | -0.3 | -0.2 |
| Q5_K_M | -0.7 | -0.4 |
| Q4_K_M | -1.5 | -0.8 |
| Q3_K_M | -3.5 | -2.0 |
| Q2_K | -8.0 | -5.5 |
The i-matrix gives back roughly half the regression. For Q4_K_M, the production sweet spot, that’s the difference between “noticeable” and “imperceptible” in chat use.
Picking a quant in 2026
- Q4_K_M with i-matrix: the universal default for local LLMs ≥ 3B.
- Q5_K_M with i-matrix: when you have headroom (memory or disk) and want lossless-feeling quality.
- Q6_K: near-lossless; rarely worth it vs Q5_K_M.
- Q8_0: a “sanity baseline” — basically as good as FP16, twice the file size of Q4_K_M.
- Q3_K_M: only when you can’t fit Q4 (typically low-RAM devices). With i-matrix, Q3_K is usable.
- Q2_K: aggressive edge; only when the device can’t afford anything else. Visible quality drop.
Hugging Face GGUF ecosystem
The bartowski and LoneStriker users on Hugging Face publish thousands of quantized GGUF variants per popular base model — typically all the K-quant levels with i-matrix calibration. Convention:
Llama-3.2-3B-Instruct-Q4_K_M.gguf— pre-computed i-matrix-quantized.Llama-3.2-3B-Instruct.imatrix— the i-matrix file, separately.
For most engineers in 2026, “computing your own i-matrix” is unnecessary — the community has done it for every model that matters. Knowing the format and the recipe is what lets you debug or recompute when you need to.
Run it in your browser — bit-budget calculator with i-matrix
The output is the same memory-fit table local-LLM enthusiasts memorize. Knowing it cold (or having this calculator handy) is half of “ship to a phone” engineering.
Quick check
Key takeaways
- GGUF carries the quantization recipe inside the file. K-quants are baked into the format.
- Q4_K_M ≈ 4.5 effective bits/weight; M variants store some weights at 6-bit. Memorize the bits-per-weight table.
- The i-matrix is calibration-driven importance weighting. Cuts Q4_K_M’s MMLU regression roughly in half.
- Computing an i-matrix is cheap (~5 min) and usually pre-computed by the Hugging Face community per model.
- Q4_K_M with i-matrix is the 2026 universal default for local LLMs ≥ 3B.
Go deeper
- Docsllama.cpp — Quantization GuideAuthoritative. The K-quant table + i-matrix flow + recommendations per model size.
- Docsllama.cpp — imatrix toolThe CLI that computes i-matrices. README has the calibration-set advice.
- Blogllama.cpp — i-matrix Quantization DiscussionThe original discussion thread that introduced i-matrix to the codebase. Useful for the "why" rather than the "what."
- DocsHugging Face — GGUF DocumentationHow GGUF integrates with transformers (load GGUF directly into a HF pipeline).
- BlogHugging Face Blog — llamafileMozilla's single-binary llama.cpp distribution. The most ergonomic way to ship GGUFs to non-developer users.
- Repobartowski on Hugging FaceThe most prolific GGUF re-quantizer; high-quality i-matrix-calibrated variants for nearly every popular base model. The de facto index.