ML Execution & Quantization · Mosaic

Modules in this track

Tensors in Memory — strides, contiguous vs non-contiguous, building a tensor library from scratch
GPU Fundamentals — SM architecture, thread hierarchy, shared memory, GEMM
Quantization — INT8 / INT4 weight packing, dequantization kernels, KV-cache quantization

What you’ll be able to do after

Reason about why a “non-contiguous” tensor is sometimes 10× slower
Read a CUDA kernel and predict its memory traffic
Quantize a model from FP16 to INT4 and understand which tradeoffs you’re making