Skip to content

GEMM (Hopper / Blackwell)

When you read a transformer’s source code, almost every layer that matters compiles down to one operation: a matrix multiplication. Every Linear, every QKV projection, every FFN, every embedding lookup at scale — all GEMM. 80% of LLM training and inference flops live here. Make GEMM faster and the entire model gets cheaper.

GEMM is also one of the most-studied algorithms in computing. The naive triple-nested-loop version is catastrophically slow on modern hardware — it reads each element from HBM hundreds of times and uses tensor cores not at all. Production kernels (cuBLAS, CUTLASS, Triton autotuned matmul) layer optimization on optimization: cache blocking at three levels (registers, shared memory, L2), tensor-core MMA instructions for the inner loop, asynchronous TMA copies pipelined with the compute, warp specialization to overlap producers and consumers. The result lands at >70% of theoretical peak FLOPs on H100/B200.

This lesson is the systems engineer’s mental model of why fast matmul is hard, what changed on Hopper / Blackwell, and how to read a CUTLASS source file. After this you should be able to look at any kernel’s tile shape and recognize the resource budget it’s encoding.

TL;DR

  • GEMM = General Matrix Multiplication. C = α·A·B + β·C. ~80% of LLM training and inference flops live here.
  • A naive matmul reads each input element O(N)O(N) times — fully memory-bound. Production kernels do cache blocking at multiple levels (registers, shared memory, L2) so each byte from HBM gets reused 64256×\sim 64–256\times.
  • On Hopper / Blackwell, the kernel structure is TMA + WGMMA + warp specialization — async copies overlap with tensor-core matmul, and producer/consumer warps split the work.
  • CUTLASS 4 (NVIDIA) and Triton 3.x are the two ways most teams write it in 2026. Hand-tuned CUTLASS still wins by 5–10% on edge shapes; Triton wins on developer time.
  • For LLM inference at small batch (decode), GEMM becomes GEMV — different optimization regime, memory-bound, where INT4/FP8 weight quant rules.

Mental model — the multi-level tile

Three nested levels of tiling, three sets of memory:

  1. Block tile — each thread block (CTA) computes a Mtile×NtileM_{\text{tile}} \times N_{\text{tile}} chunk of CC. Pulls A and B tiles into shared memory.
  2. Warp tile — each warp inside the CTA computes a sub-chunk; loads fragments into registers.
  3. Tensor-core instruction — the inner loop is a fixed-size mma (e.g., 64×128×1664 \times 128 \times 16 on Hopper WGMMA).

The whole kernel is a triple nested loop with async data movement between levels.

The four eras

Era 1: Naive (the wrong baseline)

for (int m = 0; m < M; m++) for (int n = 0; n < N; n++) for (int k = 0; k < K; k++) C[m][n] += A[m][k] * B[k][n];

For M=N=K=4096M = N = K = 4096 this reads B’s columns MM times — every element of B is loaded 4096× from HBM. Arithmetic intensity (flops per byte loaded) is essentially O(1)O(1). Memory-bound, ~1% of peak.

Era 2: Cache blocking on CPU (Goto, GotoBLAS)

Tile the loops so a sub-tile of B fits in L1, gets reused MtileM_{\text{tile}} times. Loop ordering matters:

for (int kk = 0; kk < K; kk += K_TILE) for (int mm = 0; mm < M; mm += M_TILE) for (int nn = 0; nn < N; nn += N_TILE) // ... inner block uses sub-tiles, with the K loop innermost so A is contiguous

This was state of the art for CPU GEMM through the 2000s. Concept survives; everything modern is cache blocking with extra steps.

Era 3: Tensor cores on Volta–Ampere

Volta (2017) introduced tensor cores — a single instruction that computes a 4×4×44 \times 4 \times 4 fp16 matmul-accumulate. By Ampere (A100, 2020), the instruction was 16×16×1616 \times 16 \times 16, and the throughput was an order of magnitude over fp32 ALUs.

Every modern GEMM kernel structure became:

for k_tile in K: cooperative load A_tile, B_tile into SMEM __syncthreads() for warp_k in K_tile: load A_frag, B_frag into registers via ldmatrix mma.sync(C_frag, A_frag, B_frag, C_frag) // tensor core instruction __syncthreads() write C tile back to HBM

CUTLASS’s “TileGEMM” generation pattern was built around this. Each tile size is tuned for the SM and the dtype.

Era 4: Hopper & Blackwell — async + warp specialization

Hopper (H100, 2022) added two big things: TMA (Tensor Memory Accelerator) and WGMMA (warpgroup matmul). Blackwell (B200, 2025) added 5th-gen tensor cores, NVFP4 native, and CGAs (cluster-level cooperation).

The new structural pattern is producer/consumer warps:

warp_group_0 (producer): TMA loads — `cp.async.bulk` from HBM into SMEM with mbarriers warp_group_1 (consumer): WGMMA — reads SMEM, accumulates into register file mbarrier signals when each tile is ready; double-buffered SMEM keeps both groups busy

The kernel is essentially a software pipeline: tile k+1k+1 is being TMA-loaded while tile kk is being computed. This hides HBM latency completely.

CUTLASS 4 expresses this with the CuTe layout language. A modern Hopper GEMM kernel is ~300 lines of C++ that orchestrates this dance.

Real numbers — Llama-3.1 70B inference, batch 1, decode

Per-token GEMM time on a single H100, BF16:

Implementationtok/s (decode)% of HBM-BW peakNotes
Naive PyTorch~54%Reference only
torch.matmul (cuBLAS)~3830%Default in eager PyTorch
torch.compile (Inductor → CUTLASS)~5241%+ fusion of dequant + GEMM
vLLM + Marlin INT4 kernel~8870%Memory-bound regime; weight quant pays off
TensorRT-LLM FP8~110Compute-bound past this point

Decode is memory-bound. Each generated token re-reads the entire model from HBM. Faster GEMM doesn’t help past ~70% HBM-bw — what helps is reading less (quantization, KV-cache compression).

Prefill is compute-bound. Long prompts process many tokens in parallel, MM and KK both large, kernel hits 80%+ of fp16 peak. This is where tensor-core throughput really matters.

This split is the central reason production stacks treat prefill and decode as fundamentally different jobs (see Disaggregated Serving).

Run it in your browser — predict GEMM throughput

Python — editableRoofline-style estimate. Plug in shape + dtype + hardware, get an upper bound for tok/s.
Ctrl+Enter to run

You’ll see arithmetic intensity ~1024 for big training shapes (compute-bound, GEMM thrives) and ~256 for inference attention (still compute-bound), but AI < 16 for batch-1 decode — pure memory-bound, no kernel can save you.

Quick check

Quick check
Your LLM inference is at batch 1 generating 50 tok/s on an H100. The GEMM kernel achieves 70% of HBM bandwidth peak. What's the most likely path to 100 tok/s?

Key takeaways

  1. GEMM is the workload. 80%+ of training and inference compute. Fast GEMM = cheaper AI.
  2. Three levels of tiling — block (SMEM), warp (registers), tensor-core instruction. Modern kernels are software pipelines across these.
  3. Hopper / Blackwell broke the old kernel template. TMA + WGMMA + warp specialization replaced the __syncthreads() + mma.sync pattern. CUTLASS 4 / CuTe is the way to write it.
  4. Prefill ≠ decode. Prefill is compute-bound, GEMM peak matters. Decode is memory-bound, weight quantization rules.
  5. Don’t write GEMM from scratch in production. Use cuBLAS, CUTLASS, or Triton autotune. Read CUTLASS source — it’s the best free systems-coding reference on Earth.

Go deeper

TL;DR

  • GEMM = General Matrix Multiplication. C = α·A·B + β·C. ~80% of LLM training and inference flops live here.
  • A naive matmul reads each input element O(N)O(N) times — fully memory-bound. Production kernels do cache blocking at multiple levels (registers, shared memory, L2) so each byte from HBM gets reused 64256×\sim 64–256\times.
  • On Hopper / Blackwell, the kernel structure is TMA + WGMMA + warp specialization — async copies overlap with tensor-core matmul, and producer/consumer warps split the work.
  • CUTLASS 4 (NVIDIA) and Triton 3.x are the two ways most teams write it in 2026. Hand-tuned CUTLASS still wins by 5–10% on edge shapes; Triton wins on developer time.
  • For LLM inference at small batch (decode), GEMM becomes GEMV — different optimization regime, memory-bound, where INT4/FP8 weight quant rules.

Why this matters

When you read PyTorch model code, almost every Linear(...) layer compiles to a GEMM. Same for every attention QK/SV projection, every FFN block, every embedding lookup at scale. Changing how fast GEMM runs changes how much your model costs.

A 70B inference at long context is ~80% GEMM time. A frontier pretraining run is ~85% GEMM time. The job of every kernel library — cuBLAS, CUTLASS, hipBLASLt, ROCm Composable Kernel, Triton autotune — is to land GEMM at >70% of theoretical peak across the shapes your workload actually uses.

This lesson is the systems engineer’s mental model of why fast matmul is hard, what changed on Hopper / Blackwell, and how to read a CUTLASS source file.

Mental model — the multi-level tile

Three nested levels of tiling, three sets of memory:

  1. Block tile — each thread block (CTA) computes a Mtile×NtileM_{\text{tile}} \times N_{\text{tile}} chunk of CC. Pulls A and B tiles into shared memory.
  2. Warp tile — each warp inside the CTA computes a sub-chunk; loads fragments into registers.
  3. Tensor-core instruction — the inner loop is a fixed-size mma (e.g., 64×128×1664 \times 128 \times 16 on Hopper WGMMA).

The whole kernel is a triple nested loop with async data movement between levels.

Concrete walkthrough — the four eras

Era 1: Naive (the wrong baseline)

for (int m = 0; m < M; m++) for (int n = 0; n < N; n++) for (int k = 0; k < K; k++) C[m][n] += A[m][k] * B[k][n];

For M=N=K=4096M = N = K = 4096 this reads B’s columns MM times — every element of B is loaded 4096× from HBM. Arithmetic intensity (flops per byte loaded) is essentially O(1)O(1). Memory-bound, ~1% of peak.

Era 2: Cache blocking on CPU (Goto, GotoBLAS)

Tile the loops so a sub-tile of B fits in L1, gets reused MtileM_{\text{tile}} times. Loop ordering matters:

for (int kk = 0; kk < K; kk += K_TILE) for (int mm = 0; mm < M; mm += M_TILE) for (int nn = 0; nn < N; nn += N_TILE) // ... inner block uses sub-tiles, with the K loop innermost so A is contiguous

This was state of the art for CPU GEMM through the 2000s. Concept survives; everything modern is cache blocking with extra steps.

Era 3: Tensor cores on Volta–Ampere

Volta (2017) introduced tensor cores — a single instruction that computes a 4×4×44 \times 4 \times 4 fp16 matmul-accumulate. By Ampere (A100, 2020), the instruction was 16×16×1616 \times 16 \times 16, and the throughput was an order of magnitude over fp32 ALUs.

Every modern GEMM kernel structure became:

for k_tile in K: cooperative load A_tile, B_tile into SMEM __syncthreads() for warp_k in K_tile: load A_frag, B_frag into registers via ldmatrix mma.sync(C_frag, A_frag, B_frag, C_frag) // tensor core instruction __syncthreads() write C tile back to HBM

CUTLASS’s “TileGEMM” generation pattern was built around this. Each tile size is tuned for the SM and the dtype.

Era 4: Hopper & Blackwell — async + warp specialization

Hopper (H100, 2022) added two big things: TMA (Tensor Memory Accelerator) and WGMMA (warpgroup matmul). Blackwell (B200, 2025) added 5th-gen tensor cores, NVFP4 native, and CGAs (cluster-level cooperation).

The new structural pattern is producer/consumer warps:

warp_group_0 (producer): TMA loads — `cp.async.bulk` from HBM into SMEM with mbarriers warp_group_1 (consumer): WGMMA — reads SMEM, accumulates into register file mbarrier signals when each tile is ready; double-buffered SMEM keeps both groups busy

The kernel is essentially a software pipeline: tile k+1k+1 is being TMA-loaded while tile kk is being computed. This hides HBM latency completely.

CUTLASS 4 expresses this with the CuTe layout language. A modern Hopper GEMM kernel is ~300 lines of C++ that orchestrates this dance.

Real numbers — Llama-3.1 70B inference, batch 1, decode

Per-token GEMM time on a single H100, BF16:

Implementationtok/s (decode)% of HBM-BW peakNotes
Naive PyTorch~54%Reference only
torch.matmul (cuBLAS)~3830%Default in eager PyTorch
torch.compile (Inductor → CUTLASS)~5241%+ fusion of dequant + GEMM
vLLM + Marlin INT4 kernel~8870%Memory-bound regime; weight quant pays off
TensorRT-LLM FP8~110Compute-bound past this point

Decode is memory-bound. Each generated token re-reads the entire model from HBM. Faster GEMM doesn’t help past ~70% HBM-bw — what helps is reading less (quantization, KV-cache compression).

Prefill is compute-bound. Long prompts process many tokens in parallel, MM and KK both large, kernel hits 80%+ of fp16 peak. This is where tensor-core throughput really matters.

This split is the central reason production stacks treat prefill and decode as fundamentally different jobs (see Disaggregated Serving).

Run it in your browser — predict GEMM throughput

Python — editableRoofline-style estimate. Plug in shape + dtype + hardware, get an upper bound for tok/s.
Ctrl+Enter to run

You’ll see arithmetic intensity ~1024 for big training shapes (compute-bound, GEMM thrives) and ~256 for inference attention (still compute-bound), but AI < 16 for batch-1 decode — pure memory-bound, no kernel can save you.

Quick check

Quick check
Your LLM inference is at batch 1 generating 50 tok/s on an H100. The GEMM kernel achieves 70% of HBM bandwidth peak. What's the most likely path to 100 tok/s?

Key takeaways

  1. GEMM is the workload. 80%+ of training and inference compute. Fast GEMM = cheaper AI.
  2. Three levels of tiling — block (SMEM), warp (registers), tensor-core instruction. Modern kernels are software pipelines across these.
  3. Hopper / Blackwell broke the old kernel template. TMA + WGMMA + warp specialization replaced the __syncthreads() + mma.sync pattern. CUTLASS 4 / CuTe is the way to write it.
  4. Prefill ≠ decode. Prefill is compute-bound, GEMM peak matters. Decode is memory-bound, weight quantization rules.
  5. Don’t write GEMM from scratch in production. Use cuBLAS, CUTLASS, or Triton autotune. Read CUTLASS source — it’s the best free systems-coding reference on Earth.

Go deeper