Quantization
Memory bandwidth, not compute, is the bottleneck for most LLM inference. Quantization is how you move less data — store weights at 4 or 8 bits, expand them only when you need to compute. The tradeoff: a small accuracy hit for a large speedup.
0 / 4 lessons ~58 min total
Module capstone — build it
A 4-bit Llama running on your phone Llama-3.2-3B — quantized to 2 GB, decoded at 12+ tok/sec on the phone in your pocket. Offline. No cloud.
Intermediate · One focused Saturday (~8 h) · Runs on your phone
A working chat app on your phone running Llama-3.2-3B-Instruct as Q4_K_M GGUF via llama.cpp. The artifact is the app + a benchmark report (steady-state tokens/sec under thermal throttling).
Build it — step by step
01 Get the model + convert to GGUF 30 min
Download Llama-3.2-3B-Instruct from Hugging Face. Run `convert_hf_to_gguf.py` from llama.cpp to produce `model.gguf` (FP16, ~6 GB).
checkpoint You can run `./main -m model.gguf -p "Hello"` on your laptop and get sensible output.
watch out Download requires HF auth — `huggingface-cli login` first. Some chat templates need an explicit `--chat-template llama-3` flag.
02 Quantize to Q4_K_M 15 min
Run `quantize -m model.gguf model-q4.gguf Q4_K_M`. The output is ~2 GB. Verify it generates sensible text via `./main`.
checkpoint ~2 GB GGUF; output looks like the FP16 version on the same prompts.
03 Build llama.cpp for your phone 90 min
iOS: build the Metal-enabled XCFramework via the project's CMake. Android: NDK build with Vulkan or NEON. Verify binaries exist for the target arch.
checkpoint A small test program on your device loads the model and prints one token.
watch out iOS Metal needs to initialize on the main thread; on Android the NDK toolchain version matters (use 26+).
04 Wrap in a 50-line chat UI 180 min
iOS SwiftUI ChatView (or Android Compose equivalent) with TextField + List of messages. Stream tokens via the llama.cpp C API; update UI per token.
checkpoint Type, see tokens stream in, no cloud. Airplane mode works.
05 Benchmark under thermal throttling 45 min
Generate 500 tokens 10 times back-to-back. Plot tokens/sec per run. Note steady-state (post-throttle) tok/s. Recent iPhone Pro: 10–20 tok/s for 3B Q4_K_M; Snapdragon 8 Gen 3: 8–15 tok/s.
checkpoint You have a thermal-throttle plot. Steady state hits the documented range.
06 README + push 60 min
Repo with the source, the GGUF download script (don't commit the 2GB file), the benchmark plot, and a 30s demo video of the app in airplane mode.
checkpoint Reader can clone, build, and have a 4-bit Llama running on their phone in <1 hour.
You walk away with
A working LLM chat app on your phone — fully offline, fully local Fluency with GGUF, llama.cpp's C API, native mobile build pipelines A benchmark methodology for thermal throttling on mobile silicon A demo video that's genuinely fun to show people Tools you'll use llama.cpp GGUF format Q4_K_M K-quants Android NDK / iOS Metal Llama-3.2-3B-Instruct
Going for an inference-systems role instead of edge? Same module, different artifact — implement the quantization itself, on cloud GPU + your Mac:
Module capstone — build it
W4 quantization on Llama-3-8B — from scratch, cloud + Mac Llama-3-8B quantized to 4 bits with your own kernel — benchmarked against bitsandbytes, AWQ, and GPTQ on H100, then ported to MLX and run on your MacBook. The portfolio piece every inference team wants to see.
Frontier · Five focused weeks (~80 h) ·
A repo with: (1) your from-scratch W4A16 group-wise quantization in PyTorch, (2) a `QuantLinear` that drops into Llama-3-8B, (3) benchmarks vs bitsandbytes / AWQ / GPTQ on memory, throughput, and MMLU, (4) a port of the same weights to MLX with Mac M-series numbers.
Build it — step by step
01 Read the prior art — one page of notes per scheme 6 h
Read LLM.int8(), GPTQ, AWQ, SmoothQuant. One page of notes per paper: what they quantize (W vs A), how they pick scales, what their accuracy claim is, what their runtime cost is.
checkpoint Four one-page notes. You can answer: why does AWQ beat RTN, why does GPTQ need calibration data, why does SmoothQuant migrate activation outliers into weights.
02 RTN + group-wise INT4 in pure PyTorch 8 h
Round-to-nearest with a per-group (group_size=128) scale and zero-point. Pack two int4 values per int8 byte. Write `quantize(weight) -> (qweight, scale, zero)` and `dequantize(...) -> weight`. Round-trip relative L2 error < 1%.
checkpoint A 1024×1024 FP16 weight survives quant→dequant with relative L2 error under 1%.
watch out Off-by-one in group boundaries silently corrupts the last group. Test with a non-divisible row size (e.g. 1023) — most bugs surface there.
03 QuantLinear that drops into Llama-3-8B 10 h
Write `QuantLinear(nn.Module)` that holds packed int4 weights + scales + zeros, dequantizes to FP16 inside `forward`, and matmuls. Replace every `nn.Linear` in the attention and MLP blocks. Verify generation reads coherently on 5 prompts.
checkpoint Quantized model peak GPU memory ~5 GB (vs ~16 GB FP16). Sample generations on 5 prompts read fluently.
watch out Quantizing `lm_head` and the embeddings tanks quality. Skip them deliberately — keep them FP16.
04 Benchmark harness — memory, throughput, accuracy 12 h
Three measurements: (1) `torch.cuda.max_memory_allocated()` after first forward, (2) tokens/sec at batch 1 / 8 / 32 with seq_len 2048, (3) MMLU 5-shot via lm-evaluation-harness. Five rows: FP16, your INT4, NF4 (bitsandbytes), AWQ, GPTQ.
checkpoint Five rows × three columns of honest numbers. Your INT4 will likely lose 1–3 MMLU points to AWQ — report it, do not hide it.
watch out A tiny eval subset (`limit=10`) gives noisy MMLU. Use the full dev set or do not report a number.
05 Port to MLX and run on your Mac 12 h
Convert your packed INT4 weights to `mlx.core.array`. Build the same `QuantLinear` in MLX. Generate on the Mac. Measure tokens/sec and active memory.
checkpoint Llama-3-8B-Instruct INT4 generates on a 24 GB Mac at ~10–25 tok/s. Same model on H100 at ~80–120 tok/s. Both numbers in the README.
watch out MLX uses unified memory — CUDA-style `max_memory_allocated` does not apply. Use `mlx.core.metal.get_active_memory()`.
06 Plots + blog + ship 12 h
Four plots: (a) memory bar, (b) tokens/sec at batch 1/8/32, (c) MMLU bar, (d) accuracy-vs-memory Pareto. README with reproduction commands. Blog post (~2000 words) walking through one surprising finding. Cross-post on Hugging Face blog. Twitter thread with one chart per tweet.
checkpoint Repo public, blog live, four plots embedded, reproduction works on a fresh H100 box and on a fresh Mac.
You walk away with
A from-scratch W4 quantization that drops into a real model — not a wrapper around bitsandbytes Honest, side-by-side numbers vs the four production-grade alternatives — the artifact every inference recruiter actually reads Cross-platform numbers (cloud H100 + Mac) in one repo — a differentiator most candidates do not have A reproducible benchmark methodology (memory, throughput, accuracy) you can reuse for any future kernel work Tools you'll use PyTorch 2.x Llama-3-8B-Instruct bitsandbytes (NF4) AWQ GPTQ MLX (Apple Silicon) lm-evaluation-harness for MMLU