Kernels & Hardware
When the compiler isn’t enough, you write the kernel. This module covers the kernel DSLs that win in practice — Triton (the daily driver), CUTLASS / CuTe (the NVIDIA reference), ThunderKittens / TileLang (the post-Triton frontier) — and a coming lesson on the 2026 hardware landscape that frames which DSL fits which chip.
0 / 4 lessons ~60 min total
Module capstone — build it
Three matmuls, three DSLs — same kernel, three implementations The fastest way to feel each DSL's voice is to write the same kernel in all three. Triton, CUTLASS, ThunderKittens (or TileLang).
Frontier · Two focused weekends (~30 h) · Free Colab T4
One repo with three implementations of a 16-bit GEMM (M=N=2048, K=4096): a Triton autotune kernel, a CUTLASS-builder GEMM, and a ThunderKittens or TileLang kernel. The artifact is the three sources side by side, a benchmark plot showing all three vs cuBLAS, and a one-page README discussing the lines-of-code-vs-perf tradeoff each picked.
Build it — step by step
01 Set up a benchmarking harness 45 min
A `bench.py` that takes a kernel function and runs it on M=N=2048, K=4096 FP16 inputs 50 times after a warmup. Reports TFLOPs/s. Compares against `torch.matmul` (cuBLAS).
checkpoint You can plug in any kernel function and get a TFLOPs/s number.
watch out Forgetting `torch.cuda.synchronize()` before stopping the timer measures launch latency, not kernel time.
02 Implementation 1 — Triton with autotune 90 min
Write the matmul kernel from this lesson, with 6–8 autotune configs spanning (BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages). Verify correctness vs torch.matmul, run bench.
checkpoint Triton kernel hits 75–90% of cuBLAS at M=N=2048. Lines of code: ~50.
watch out Forgetting to mask the load when M, N, or K aren't multiples of the block size. Random nonsense values poison your accumulator silently.
03 Implementation 2 — CUTLASS using CollectiveBuilder 180 min
Use the Hopper CollectiveBuilder pattern from the CuTe lesson. Pick `KernelTmaWarpSpecializedPingpong` schedule, a 128×128×64 tile, 2-CTA cluster. Compile (slow). Bench.
checkpoint CUTLASS kernel hits 85–95% of cuBLAS. Lines of code: ~150 (most of which is template parameters).
watch out Wrong precision combination → opaque template-instantiation error. Pin the exact (ElementA, LayoutA, ElementB, LayoutB, ElementAcc, ElementOutput) tuple from a known-working CUTLASS example.
04 Implementation 3 — ThunderKittens (or TileLang) 120 min
Pick whichever DSL's build environment you can stand up faster. Implement the same 64×64 register-tile GEMM with TMA loads and a two-stage pipeline. Bench.
checkpoint TK/TileLang kernel within 5% of CUTLASS, in ~30–60 lines of source.
watch out TK template errors are walls. When stuck, copy from one of the `kernels/matmul/` examples in the TK repo verbatim, then modify.
05 Profile all three with ncu 60 min
Capture a profile of each kernel at the same shape: `ncu --set full --kernel-name <kernel_name> python bench.py`. Look at tensor-core utilization, SMEM bank conflicts, occupancy. Note the differences.
checkpoint You have a 3-row table comparing tensor-core %, SMEM conflicts, and TFLOPs/s for the three.
06 README + plot 60 min
Repo with `triton_gemm.py`, `cutlass_gemm.cu`, `tk_gemm.cu` (or `tilelang_gemm.py`), `bench.py`, `README.md` that summarizes the LOC, peak TFLOPs/s, time-to-write, and which one you'd reach for next time.
checkpoint A reader who clones can install dependencies, run `python bench.py`, and reproduce your plot.
You walk away with
Working implementations of the same kernel in three different DSLs — the canonical career portfolio piece for a kernel engineer A felt sense of which DSL costs what (lines of code, time, the last 5% of perf) Fluency reading ncu output across kernel styles — what a Triton kernel's profile looks like vs CUTLASS vs TK A repo whose three sources let any future-you (or hire) immediately see how each DSL expresses the same idea Tools you'll use Triton 3.x (autotune) CUTLASS 4 (Hopper builder) ThunderKittens or TileLang cuBLAS as reference ncu for the kernel-level profile matplotlib for the plot