02Track 02 · ML Execution & Quantization
What an ML workload actually does on the metal.
Before reasoning about ML compilers, you need to understand what an ML workload actually looks like at the hardware level. The primitives every model is made of — tensors, matrix multiplication, GPU programming, and the numerics tricks that make models fit on real machines.
- — strides, contiguous vs non-contiguous, building a tensor library from scratch
- — SM architecture, thread hierarchy, shared memory, GEMM
- — INT8 / INT4 weight packing, dequantization kernels, KV-cache quantization
- Reason about why a “non-contiguous” tensor is sometimes 10× slower
- Read a CUDA kernel and predict its memory traffic
- Quantize a model from FP16 to INT4 and understand which tradeoffs you’re making