Skip to content

Quantization

Memory bandwidth, not compute, is the bottleneck for most LLM inference. Quantization is how you move less data — store weights at 4 or 8 bits, expand them only when you need to compute. The tradeoff: a small accuracy hit for a large speedup.


Going for an inference-systems role instead of edge? Same module, different artifact — implement the quantization itself, on cloud GPU + your Mac: