GEMM (Hopper / Blackwell)
When you read a transformer’s source code, almost every layer that matters compiles down to one operation: a matrix multiplication. Every Linear, every QKV projection, every FFN, every embedding lookup at scale — all GEMM. 80% of LLM training and inference flops live here. Make GEMM faster and the entire model gets cheaper.
GEMM is also one of the most-studied algorithms in computing. The naive triple-nested-loop version is catastrophically slow on modern hardware — it reads each element from HBM hundreds of times and uses tensor cores not at all. Production kernels (cuBLAS, CUTLASS, Triton autotuned matmul) layer optimization on optimization: cache blocking at three levels (registers, shared memory, L2), tensor-core MMA instructions for the inner loop, asynchronous TMA copies pipelined with the compute, warp specialization to overlap producers and consumers. The result lands at >70% of theoretical peak FLOPs on H100/B200.
This lesson is the systems engineer’s mental model of why fast matmul is hard, what changed on Hopper / Blackwell, and how to read a CUTLASS source file. After this you should be able to look at any kernel’s tile shape and recognize the resource budget it’s encoding.
TL;DR
- GEMM = General Matrix Multiplication.
C = α·A·B + β·C. ~80% of LLM training and inference flops live here. - A naive matmul reads each input element times — fully memory-bound. Production kernels do cache blocking at multiple levels (registers, shared memory, L2) so each byte from HBM gets reused .
- On Hopper / Blackwell, the kernel structure is TMA + WGMMA + warp specialization — async copies overlap with tensor-core matmul, and producer/consumer warps split the work.
- CUTLASS 4 (NVIDIA) and Triton 3.x are the two ways most teams write it in 2026. Hand-tuned CUTLASS still wins by 5–10% on edge shapes; Triton wins on developer time.
- For LLM inference at small batch (decode), GEMM becomes GEMV — different optimization regime, memory-bound, where INT4/FP8 weight quant rules.
Mental model — the multi-level tile
Three nested levels of tiling, three sets of memory:
- Block tile — each thread block (CTA) computes a chunk of . Pulls A and B tiles into shared memory.
- Warp tile — each warp inside the CTA computes a sub-chunk; loads fragments into registers.
- Tensor-core instruction — the inner loop is a fixed-size mma (e.g., on Hopper WGMMA).
The whole kernel is a triple nested loop with async data movement between levels.
The four eras
Era 1: Naive (the wrong baseline)
for (int m = 0; m < M; m++)
for (int n = 0; n < N; n++)
for (int k = 0; k < K; k++)
C[m][n] += A[m][k] * B[k][n];For this reads B’s columns times — every element of B is loaded 4096× from HBM. Arithmetic intensity (flops per byte loaded) is essentially . Memory-bound, ~1% of peak.
Era 2: Cache blocking on CPU (Goto, GotoBLAS)
Tile the loops so a sub-tile of B fits in L1, gets reused times. Loop ordering matters:
for (int kk = 0; kk < K; kk += K_TILE)
for (int mm = 0; mm < M; mm += M_TILE)
for (int nn = 0; nn < N; nn += N_TILE)
// ... inner block uses sub-tiles, with the K loop innermost so A is contiguousThis was state of the art for CPU GEMM through the 2000s. Concept survives; everything modern is cache blocking with extra steps.
Era 3: Tensor cores on Volta–Ampere
Volta (2017) introduced tensor cores — a single instruction that computes a fp16 matmul-accumulate. By Ampere (A100, 2020), the instruction was , and the throughput was an order of magnitude over fp32 ALUs.
Every modern GEMM kernel structure became:
for k_tile in K:
cooperative load A_tile, B_tile into SMEM
__syncthreads()
for warp_k in K_tile:
load A_frag, B_frag into registers via ldmatrix
mma.sync(C_frag, A_frag, B_frag, C_frag) // tensor core instruction
__syncthreads()
write C tile back to HBMCUTLASS’s “TileGEMM” generation pattern was built around this. Each tile size is tuned for the SM and the dtype.
Era 4: Hopper & Blackwell — async + warp specialization
Hopper (H100, 2022) added two big things: TMA (Tensor Memory Accelerator) and WGMMA (warpgroup matmul). Blackwell (B200, 2025) added 5th-gen tensor cores, NVFP4 native, and CGAs (cluster-level cooperation).
The new structural pattern is producer/consumer warps:
warp_group_0 (producer): TMA loads — `cp.async.bulk` from HBM into SMEM with mbarriers
warp_group_1 (consumer): WGMMA — reads SMEM, accumulates into register file
mbarrier signals when each tile is ready; double-buffered SMEM keeps both groups busyThe kernel is essentially a software pipeline: tile is being TMA-loaded while tile is being computed. This hides HBM latency completely.
CUTLASS 4 expresses this with the CuTe layout language. A modern Hopper GEMM kernel is ~300 lines of C++ that orchestrates this dance.
Real numbers — Llama-3.1 70B inference, batch 1, decode
Per-token GEMM time on a single H100, BF16:
| Implementation | tok/s (decode) | % of HBM-BW peak | Notes |
|---|---|---|---|
| Naive PyTorch | ~5 | 4% | Reference only |
torch.matmul (cuBLAS) | ~38 | 30% | Default in eager PyTorch |
torch.compile (Inductor → CUTLASS) | ~52 | 41% | + fusion of dequant + GEMM |
| vLLM + Marlin INT4 kernel | ~88 | 70% | Memory-bound regime; weight quant pays off |
| TensorRT-LLM FP8 | ~110 | — | Compute-bound past this point |
Decode is memory-bound. Each generated token re-reads the entire model from HBM. Faster GEMM doesn’t help past ~70% HBM-bw — what helps is reading less (quantization, KV-cache compression).
Prefill is compute-bound. Long prompts process many tokens in parallel, and both large, kernel hits 80%+ of fp16 peak. This is where tensor-core throughput really matters.
This split is the central reason production stacks treat prefill and decode as fundamentally different jobs (see Disaggregated Serving).
Run it in your browser — predict GEMM throughput
You’ll see arithmetic intensity ~1024 for big training shapes (compute-bound, GEMM thrives) and ~256 for inference attention (still compute-bound), but AI < 16 for batch-1 decode — pure memory-bound, no kernel can save you.
Quick check
Key takeaways
- GEMM is the workload. 80%+ of training and inference compute. Fast GEMM = cheaper AI.
- Three levels of tiling — block (SMEM), warp (registers), tensor-core instruction. Modern kernels are software pipelines across these.
- Hopper / Blackwell broke the old kernel template. TMA + WGMMA + warp specialization replaced the
__syncthreads()+mma.syncpattern. CUTLASS 4 / CuTe is the way to write it. - Prefill ≠ decode. Prefill is compute-bound, GEMM peak matters. Decode is memory-bound, weight quantization rules.
- Don’t write GEMM from scratch in production. Use cuBLAS, CUTLASS, or Triton autotune. Read CUTLASS source — it’s the best free systems-coding reference on Earth.
Go deeper
- BlogHow to Optimize a CUDA Matmul Kernel for cuBLAS-like PerformanceWalks through 10 progressive kernel versions on Ampere. The single best one-blog GEMM tutorial.
- PaperOutperforming cuBLAS on H100: A Worked Example with FP16The Hopper-specific kernel that finally beat cuBLAS on a real shape. CUTLASS-based.
- DocsNVIDIA/cutlassCUTLASS 4 with CuTe. The reference. Read `examples/` and the `cute/` headers — best systems C++ on the internet.
- DocsTriton — Matrix Multiplication tutorial~150 lines of Python that hits >70% of cuBLAS perf. The gentlest path to a fast modern GEMM.
- PaperThunderKittens: Simple, Fast, and Adorable AI KernelsTile-DSL alternative to CUTLASS — terse, expressive, hits the same peak.
- VideoTri Dao — Hardware Software Co-Design for AIWhy kernels look the way they do, from the FlashAttention author.
- DocsPTX ISA — wgmma / mma instructionsThe actual hardware instruction set. Skim this once; you'll never look at it the same way again.
- PaperMarlin: Mixed-Precision GEMM for INT4 LLMsThe INT4 weight + FP16 activation kernel that powers vLLM's quant inference. Required reading for memory-bound regime.
TL;DR
- GEMM = General Matrix Multiplication.
C = α·A·B + β·C. ~80% of LLM training and inference flops live here. - A naive matmul reads each input element times — fully memory-bound. Production kernels do cache blocking at multiple levels (registers, shared memory, L2) so each byte from HBM gets reused .
- On Hopper / Blackwell, the kernel structure is TMA + WGMMA + warp specialization — async copies overlap with tensor-core matmul, and producer/consumer warps split the work.
- CUTLASS 4 (NVIDIA) and Triton 3.x are the two ways most teams write it in 2026. Hand-tuned CUTLASS still wins by 5–10% on edge shapes; Triton wins on developer time.
- For LLM inference at small batch (decode), GEMM becomes GEMV — different optimization regime, memory-bound, where INT4/FP8 weight quant rules.
Why this matters
When you read PyTorch model code, almost every Linear(...) layer compiles to a GEMM. Same for every attention QK/SV projection, every FFN block, every embedding lookup at scale. Changing how fast GEMM runs changes how much your model costs.
A 70B inference at long context is ~80% GEMM time. A frontier pretraining run is ~85% GEMM time. The job of every kernel library — cuBLAS, CUTLASS, hipBLASLt, ROCm Composable Kernel, Triton autotune — is to land GEMM at >70% of theoretical peak across the shapes your workload actually uses.
This lesson is the systems engineer’s mental model of why fast matmul is hard, what changed on Hopper / Blackwell, and how to read a CUTLASS source file.
Mental model — the multi-level tile
Three nested levels of tiling, three sets of memory:
- Block tile — each thread block (CTA) computes a chunk of . Pulls A and B tiles into shared memory.
- Warp tile — each warp inside the CTA computes a sub-chunk; loads fragments into registers.
- Tensor-core instruction — the inner loop is a fixed-size mma (e.g., on Hopper WGMMA).
The whole kernel is a triple nested loop with async data movement between levels.
Concrete walkthrough — the four eras
Era 1: Naive (the wrong baseline)
for (int m = 0; m < M; m++)
for (int n = 0; n < N; n++)
for (int k = 0; k < K; k++)
C[m][n] += A[m][k] * B[k][n];For this reads B’s columns times — every element of B is loaded 4096× from HBM. Arithmetic intensity (flops per byte loaded) is essentially . Memory-bound, ~1% of peak.
Era 2: Cache blocking on CPU (Goto, GotoBLAS)
Tile the loops so a sub-tile of B fits in L1, gets reused times. Loop ordering matters:
for (int kk = 0; kk < K; kk += K_TILE)
for (int mm = 0; mm < M; mm += M_TILE)
for (int nn = 0; nn < N; nn += N_TILE)
// ... inner block uses sub-tiles, with the K loop innermost so A is contiguousThis was state of the art for CPU GEMM through the 2000s. Concept survives; everything modern is cache blocking with extra steps.
Era 3: Tensor cores on Volta–Ampere
Volta (2017) introduced tensor cores — a single instruction that computes a fp16 matmul-accumulate. By Ampere (A100, 2020), the instruction was , and the throughput was an order of magnitude over fp32 ALUs.
Every modern GEMM kernel structure became:
for k_tile in K:
cooperative load A_tile, B_tile into SMEM
__syncthreads()
for warp_k in K_tile:
load A_frag, B_frag into registers via ldmatrix
mma.sync(C_frag, A_frag, B_frag, C_frag) // tensor core instruction
__syncthreads()
write C tile back to HBMCUTLASS’s “TileGEMM” generation pattern was built around this. Each tile size is tuned for the SM and the dtype.
Era 4: Hopper & Blackwell — async + warp specialization
Hopper (H100, 2022) added two big things: TMA (Tensor Memory Accelerator) and WGMMA (warpgroup matmul). Blackwell (B200, 2025) added 5th-gen tensor cores, NVFP4 native, and CGAs (cluster-level cooperation).
The new structural pattern is producer/consumer warps:
warp_group_0 (producer): TMA loads — `cp.async.bulk` from HBM into SMEM with mbarriers
warp_group_1 (consumer): WGMMA — reads SMEM, accumulates into register file
mbarrier signals when each tile is ready; double-buffered SMEM keeps both groups busyThe kernel is essentially a software pipeline: tile is being TMA-loaded while tile is being computed. This hides HBM latency completely.
CUTLASS 4 expresses this with the CuTe layout language. A modern Hopper GEMM kernel is ~300 lines of C++ that orchestrates this dance.
Real numbers — Llama-3.1 70B inference, batch 1, decode
Per-token GEMM time on a single H100, BF16:
| Implementation | tok/s (decode) | % of HBM-BW peak | Notes |
|---|---|---|---|
| Naive PyTorch | ~5 | 4% | Reference only |
torch.matmul (cuBLAS) | ~38 | 30% | Default in eager PyTorch |
torch.compile (Inductor → CUTLASS) | ~52 | 41% | + fusion of dequant + GEMM |
| vLLM + Marlin INT4 kernel | ~88 | 70% | Memory-bound regime; weight quant pays off |
| TensorRT-LLM FP8 | ~110 | — | Compute-bound past this point |
Decode is memory-bound. Each generated token re-reads the entire model from HBM. Faster GEMM doesn’t help past ~70% HBM-bw — what helps is reading less (quantization, KV-cache compression).
Prefill is compute-bound. Long prompts process many tokens in parallel, and both large, kernel hits 80%+ of fp16 peak. This is where tensor-core throughput really matters.
This split is the central reason production stacks treat prefill and decode as fundamentally different jobs (see Disaggregated Serving).
Run it in your browser — predict GEMM throughput
You’ll see arithmetic intensity ~1024 for big training shapes (compute-bound, GEMM thrives) and ~256 for inference attention (still compute-bound), but AI < 16 for batch-1 decode — pure memory-bound, no kernel can save you.
Quick check
Key takeaways
- GEMM is the workload. 80%+ of training and inference compute. Fast GEMM = cheaper AI.
- Three levels of tiling — block (SMEM), warp (registers), tensor-core instruction. Modern kernels are software pipelines across these.
- Hopper / Blackwell broke the old kernel template. TMA + WGMMA + warp specialization replaced the
__syncthreads()+mma.syncpattern. CUTLASS 4 / CuTe is the way to write it. - Prefill ≠ decode. Prefill is compute-bound, GEMM peak matters. Decode is memory-bound, weight quantization rules.
- Don’t write GEMM from scratch in production. Use cuBLAS, CUTLASS, or Triton autotune. Read CUTLASS source — it’s the best free systems-coding reference on Earth.
Go deeper
- BlogHow to Optimize a CUDA Matmul Kernel for cuBLAS-like PerformanceWalks through 10 progressive kernel versions on Ampere. The single best one-blog GEMM tutorial.
- PaperOutperforming cuBLAS on H100: A Worked Example with FP16The Hopper-specific kernel that finally beat cuBLAS on a real shape. CUTLASS-based.
- DocsNVIDIA/cutlassCUTLASS 4 with CuTe. The reference. Read `examples/` and the `cute/` headers — best systems C++ on the internet.
- DocsTriton — Matrix Multiplication tutorial~150 lines of Python that hits >70% of cuBLAS perf. The gentlest path to a fast modern GEMM.
- PaperThunderKittens: Simple, Fast, and Adorable AI KernelsTile-DSL alternative to CUTLASS — terse, expressive, hits the same peak.
- VideoTri Dao — Hardware Software Co-Design for AIWhy kernels look the way they do, from the FlashAttention author.
- DocsPTX ISA — wgmma / mma instructionsThe actual hardware instruction set. Skim this once; you'll never look at it the same way again.
- PaperMarlin: Mixed-Precision GEMM for INT4 LLMsThe INT4 weight + FP16 activation kernel that powers vLLM's quant inference. Required reading for memory-bound regime.