GPU Fundamentals
GPUs are not just “fast CPUs”. They’re a different machine with a different cost model: enormous bandwidth, tiny per-thread state, painfully slow scalar code. To write fast kernels, you have to think in their terms.
0 / 4 lessons ~58 min total
Module capstone — build it
Beat cuBLAS at small-N GEMM in Triton A tiled matmul kernel that outperforms cuBLAS on the shapes that matter for LLM decode.
Advanced · One focused weekend (~10 h) · Free Colab T4
A single-file Triton kernel that hits 80–95% of cuBLAS at N=4096 and beats it at N=512–1024 (the small-N regime cuBLAS doesn't tune for). The artifact is the kernel + an ncu profile showing tensor-core utilization + a benchmark plot vs cuBLAS + a one-page README explaining the tile choice.
Build it — step by step
01 Set up Colab + verify GPU access 15 min
Open a free T4 Colab. Install Triton (pre-installed in PyTorch 2.x), run `nvidia-smi`, run a hello-world Triton kernel that doubles a vector.
checkpoint You see "Tesla T4" or "L4" in nvidia-smi output and your hello-world kernel prints the doubled tensor.
02 Write a naive matmul kernel — block per output tile 60 min
Write the canonical Triton matmul: each program computes one BLOCK_M × BLOCK_N tile of C, looping over K in chunks of BLOCK_K. Use BLOCK_M = BLOCK_N = 64, BLOCK_K = 32. Start with FP16 inputs, FP32 accumulator.
checkpoint Output matches `torch.matmul(A, B)` to atol=1e-2 for random N=512 matrices.
watch out Forgetting `mask=` on `tl.load` for the boundary tiles when M, N, K are not multiples of the block size — silent garbage values.
03 Add Triton autotune across tile shapes 45 min
Wrap the kernel in `@triton.autotune` with 6–8 configs varying (BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages). Triton picks the winner per shape automatically on first run.
checkpoint Re-run benchmark; throughput jumps 1.3–2× without changing the kernel body. Print the chosen config to confirm autotune is working.
watch out Autotune cache is per-shape — vary M/N/K in the benchmark loop or you’ll get one config compiled for one shape.
04 Benchmark vs cuBLAS at N = 256, 512, 1024, 2048, 4096 60 min
Use `torch.cuda.Event` for accurate timing. Run each shape 50× after a 10-iter warmup. Plot tokens/s (or TFLOPs) of your kernel vs `torch.matmul` (which dispatches to cuBLAS).
checkpoint You have a matplotlib plot with two curves and clear axis labels. Your kernel beats cuBLAS at N ≤ 1024.
watch out Forgetting to `.cuda.synchronize()` before stopping the timer — you’ll measure launch latency instead of execution time.
05 Profile with ncu and confirm tensor cores 60 min
Run `ncu --set full --kernel-name your_kernel python bench.py`. Look for `sm__pipe_tensor_op_cycles_active.avg.pct_of_peak_sustained_active` — should be 50%+ at large N. Note the L1/L2 hit rates and SMEM bank conflicts.
checkpoint You can point at the ncu line that proves tensor cores are running. Bank conflicts are zero or near-zero.
watch out ncu requires sudo on bare-metal Colab; use the `--launch-skip` flag if your kernel runs many times before the one you care about.
06 Push to GitHub with a clean README 60 min
Single-file repo: `kernel.py` (Triton kernel + autotune), `bench.py` (benchmark + plot), `README.md` (one paragraph explaining the win and the ncu screenshot).
checkpoint A stranger can clone, run `pip install triton torch matplotlib && python bench.py`, and reproduce your plot in 5 minutes.
You walk away with
A working Triton kernel that beats cuBLAS on a real shape range Fluency reading ncu output — tensor-core utilization, bank conflicts, occupancy A reproducible benchmarking workflow for future kernel work A clean GitHub repo someone else can clone, run, and reproduce in 5 minutes Tools you'll use Triton 3.x (autotune) CUDA 12+ ncu (Nsight Compute) CUTLASS reference tensor cores (mma.sync) matplotlib for benchmark plot