Quantization

Memory bandwidth, not compute, is the bottleneck for most LLM inference. Quantization is how you move less data — store weights at 4 or 8 bits, expand them only when you need to compute. The tradeoff: a small accuracy hit for a large speedup.

0 / 4 lessons~58 min total

Module capstone — build it

A 4-bit Llama running on your phone

Llama-3.2-3B — quantized to 2 GB, decoded at 12+ tok/sec on the phone in your pocket. Offline. No cloud.

IntermediateOne focused Saturday (~8 h)Runs on your phone

A working chat app on your phone running Llama-3.2-3B-Instruct as Q4_K_M GGUF via llama.cpp. The artifact is the app + a benchmark report (steady-state tokens/sec under thermal throttling).

Build it — step by step

01Get the model + convert to GGUF30 min
Download Llama-3.2-3B-Instruct from Hugging Face. Run `convert_hf_to_gguf.py` from llama.cpp to produce `model.gguf` (FP16, ~6 GB).
checkpoint You can run `./main -m model.gguf -p "Hello"` on your laptop and get sensible output.
watch out Download requires HF auth — `huggingface-cli login` first. Some chat templates need an explicit `--chat-template llama-3` flag.
02Quantize to Q4_K_M15 min
Run `quantize -m model.gguf model-q4.gguf Q4_K_M`. The output is ~2 GB. Verify it generates sensible text via `./main`.
checkpoint ~2 GB GGUF; output looks like the FP16 version on the same prompts.
03Build llama.cpp for your phone90 min
iOS: build the Metal-enabled XCFramework via the project's CMake. Android: NDK build with Vulkan or NEON. Verify binaries exist for the target arch.
checkpoint A small test program on your device loads the model and prints one token.
watch out iOS Metal needs to initialize on the main thread; on Android the NDK toolchain version matters (use 26+).
04Wrap in a 50-line chat UI180 min
iOS SwiftUI ChatView (or Android Compose equivalent) with TextField + List of messages. Stream tokens via the llama.cpp C API; update UI per token.
checkpoint Type, see tokens stream in, no cloud. Airplane mode works.
05Benchmark under thermal throttling45 min
Generate 500 tokens 10 times back-to-back. Plot tokens/sec per run. Note steady-state (post-throttle) tok/s. Recent iPhone Pro: 10–20 tok/s for 3B Q4_K_M; Snapdragon 8 Gen 3: 8–15 tok/s.
checkpoint You have a thermal-throttle plot. Steady state hits the documented range.
06README + push60 min
Repo with the source, the GGUF download script (don't commit the 2GB file), the benchmark plot, and a 30s demo video of the app in airplane mode.
checkpoint Reader can clone, build, and have a 4-bit Llama running on their phone in <1 hour.

You walk away with

A working LLM chat app on your phone — fully offline, fully local
Fluency with GGUF, llama.cpp's C API, native mobile build pipelines
A benchmark methodology for thermal throttling on mobile silicon
A demo video that's genuinely fun to show people

Tools you'll use

llama.cpp
GGUF format
Q4_K_M K-quants
Android NDK / iOS Metal
Llama-3.2-3B-Instruct

Going for an inference-systems role instead of edge? Same module, different artifact — implement the quantization itself, on cloud GPU + your Mac:

Module capstone — build it

W4 quantization on Llama-3-8B — from scratch, cloud + Mac

Llama-3-8B quantized to 4 bits with your own kernel — benchmarked against bitsandbytes, AWQ, and GPTQ on H100, then ported to MLX and run on your MacBook. The portfolio piece every inference team wants to see.

FrontierFive focused weeks (~80 h)

A repo with: (1) your from-scratch W4A16 group-wise quantization in PyTorch, (2) a `QuantLinear` that drops into Llama-3-8B, (3) benchmarks vs bitsandbytes / AWQ / GPTQ on memory, throughput, and MMLU, (4) a port of the same weights to MLX with Mac M-series numbers.

Build it — step by step

01Read the prior art — one page of notes per scheme6 h
Read LLM.int8(), GPTQ, AWQ, SmoothQuant. One page of notes per paper: what they quantize (W vs A), how they pick scales, what their accuracy claim is, what their runtime cost is.
checkpoint Four one-page notes. You can answer: why does AWQ beat RTN, why does GPTQ need calibration data, why does SmoothQuant migrate activation outliers into weights.
02RTN + group-wise INT4 in pure PyTorch8 h
Round-to-nearest with a per-group (group_size=128) scale and zero-point. Pack two int4 values per int8 byte. Write `quantize(weight) -> (qweight, scale, zero)` and `dequantize(...) -> weight`. Round-trip relative L2 error < 1%.
checkpoint A 1024×1024 FP16 weight survives quant→dequant with relative L2 error under 1%.
watch out Off-by-one in group boundaries silently corrupts the last group. Test with a non-divisible row size (e.g. 1023) — most bugs surface there.
03QuantLinear that drops into Llama-3-8B10 h
Write `QuantLinear(nn.Module)` that holds packed int4 weights + scales + zeros, dequantizes to FP16 inside `forward`, and matmuls. Replace every `nn.Linear` in the attention and MLP blocks. Verify generation reads coherently on 5 prompts.
checkpoint Quantized model peak GPU memory ~5 GB (vs ~16 GB FP16). Sample generations on 5 prompts read fluently.
watch out Quantizing `lm_head` and the embeddings tanks quality. Skip them deliberately — keep them FP16.
04Benchmark harness — memory, throughput, accuracy12 h
Three measurements: (1) `torch.cuda.max_memory_allocated()` after first forward, (2) tokens/sec at batch 1 / 8 / 32 with seq_len 2048, (3) MMLU 5-shot via lm-evaluation-harness. Five rows: FP16, your INT4, NF4 (bitsandbytes), AWQ, GPTQ.
checkpoint Five rows × three columns of honest numbers. Your INT4 will likely lose 1–3 MMLU points to AWQ — report it, do not hide it.
watch out A tiny eval subset (`limit=10`) gives noisy MMLU. Use the full dev set or do not report a number.
05Port to MLX and run on your Mac12 h
Convert your packed INT4 weights to `mlx.core.array`. Build the same `QuantLinear` in MLX. Generate on the Mac. Measure tokens/sec and active memory.
checkpoint Llama-3-8B-Instruct INT4 generates on a 24 GB Mac at ~10–25 tok/s. Same model on H100 at ~80–120 tok/s. Both numbers in the README.
watch out MLX uses unified memory — CUDA-style `max_memory_allocated` does not apply. Use `mlx.core.metal.get_active_memory()`.
06Plots + blog + ship12 h
Four plots: (a) memory bar, (b) tokens/sec at batch 1/8/32, (c) MMLU bar, (d) accuracy-vs-memory Pareto. README with reproduction commands. Blog post (~2000 words) walking through one surprising finding. Cross-post on Hugging Face blog. Twitter thread with one chart per tweet.
checkpoint Repo public, blog live, four plots embedded, reproduction works on a fresh H100 box and on a fresh Mac.

You walk away with

A from-scratch W4 quantization that drops into a real model — not a wrapper around bitsandbytes
Honest, side-by-side numbers vs the four production-grade alternatives — the artifact every inference recruiter actually reads
Cross-platform numbers (cloud H100 + Mac) in one repo — a differentiator most candidates do not have
A reproducible benchmark methodology (memory, throughput, accuracy) you can reuse for any future kernel work

Tools you'll use

PyTorch 2.x
Llama-3-8B-Instruct
bitsandbytes (NF4)
AWQ
GPTQ
MLX (Apple Silicon)
lm-evaluation-harness for MMLU