Roofline as a Predictive Tool
In a managed-runtime language, you write code and the JIT decides what runs on the CPU; you rarely have to reason about what’s fast and why. Writing GPU kernels flips that contract: every kernel sits somewhere on a two-line graph called the roofline, and where it sits is mostly decided before you write a single instruction — by the shape of the work, the dtype, and the hardware. The job of the roofline is not to make a pretty plot. The job is to let you predict, in your head, where a kernel will bind, before you ever open a profiler.
That predictive use is where most engineers stop and where senior inference engineers start. Anyone can quote “compute-bound vs memory-bound.” Few can look at a (M, N, K) GEMM and say “AI is 295 — I expect 78% of HBM peak, ridge point will saturate at batch 32, decode at batch 1 will hit 70% of the 3.35 TB/s ceiling.” That’s the skill. This lesson is how to build it.
TL;DR
- The roofline is a two-line graph: peak compute (FLOPs/s) and peak memory bandwidth (bytes/s). A kernel’s arithmetic intensity (AI = flops / bytes moved) places it on the x-axis. Where the kernel meets the lower of the two ceilings is its theoretical max.
- The ridge point is where the two lines cross: AI = peak_FLOPs / peak_bandwidth. On H100 fp16: AI ≈ 295 FLOPs/byte. Below it, you’re HBM-bound. Above it, you’re compute-bound.
- A useful roofline has a third regime: kernel-launch + dispatcher overhead. Below ~8 µs, the host-side cost dominates and the GPU is starved. Common in batch-1 decode and tiny attention.
- The discipline: predict regime + % of peak BEFORE profiling. Then NCU verifies. Where prediction is wrong, you learn something specific (tile size, occupancy floor, SMEM thrash, scheduler stall).
- Production kernels land at 65–85% of the predicted ceiling, not 100%. Anything below 50% means a bug; anything claiming above 90% should be re-measured.
The concept, in plain English
Hardware has two bottlenecks: how fast the chip can compute, and how fast it can move data from HBM. Every kernel is doing some ratio of compute to data movement. If the ratio is high (lots of arithmetic per byte loaded), compute is your ceiling. If the ratio is low (you’re shuffling lots of memory for little arithmetic), bandwidth is your ceiling. The roofline graph is just the two ceilings drawn on the same axes, with kernels plotted by their AI.
What makes the roofline a tool — not a plot — is that you can compute AI from the kernel’s mathematical definition, without running anything. A GEMM of shape (M, N, K) does 2·M·N·K flops and moves (M·K + K·N + M·N) · dtype_bytes of memory. The ratio is the AI, and the ratio tells you which ceiling matters. The whole game is then: did I land within striking distance of that ceiling, and if not, why?
Mental model — the predict-then-verify loop
The loop is: predict → measure → reconcile. Every kernel optimization session starts at the left and walks to the right. The interview-grade skill is doing the left half in your head.
H100 numbers you should memorize
These are the constants behind every prediction. Marketing peaks (the doubled-for-sparsity numbers) are not what you use — you use sustained, dense, achievable peaks.
| Resource | H100 SXM (achievable) | Where you read it from |
|---|---|---|
| HBM3 bandwidth | 3.35 TB/s | Whitepaper; NCU dram__bytes.sum.per_second |
| L2 cache | 50 MB | Whitepaper |
| SMEM per SM | 228 KB | Whitepaper |
| Registers per SM | 65,536 | Whitepaper |
| SMs per chip | 132 (SXM) | nvidia-smi -q |
| Peak fp16/bf16 TC | 989 TFLOPs/s | Whitepaper (dense, no sparsity) |
| Peak fp8 TC | 1979 TFLOPs/s | Whitepaper (dense) |
| Peak fp32 (no TC) | 67 TFLOPs/s | Whitepaper |
| Ridge point fp16 | AI ≈ 295 flops/byte | 989e12 / 3.35e12 |
| Ridge point fp8 | AI ≈ 590 flops/byte | 1979e12 / 3.35e12 |
The ridge point is the punchline. An fp16 kernel needs at least 295 FLOPs of compute per byte of HBM traffic to be compute-bound. Below that, no kernel optimization on Earth will push you past the HBM ceiling — you have to read less (quantize, fuse, use prefix cache), not compute faster.
Worked predictions — three shapes, three regimes
Take three real shapes and predict each one before measuring. This is the actual exercise.
Shape 1: GEMM at training scale (compute-bound)
Llama FFN forward pass during pretraining: M = 8192, N = 28672, K = 8192, fp16 (2 bytes/element, fp32 accumulator).
flops = 2 × 8192 × 28672 × 8192 ≈ 3.85 × 10¹² flops
bytes = (M·K + K·N + M·N) × 2 ≈ 1.41 × 10⁹ bytes
= (67M + 235M + 235M) × 2
AI = flops / bytes ≈ 2,725 flops/byteAI is 9× the ridge point → comfortably compute-bound. Predicted: hit the fp16 TC peak. Production kernels (cuBLAS, CUTLASS at this shape) land 80–85% of 989 TFLOPs, so expect ~800 TFLOPs sustained. Time = 3.85e12 / 8e14 ≈ 4.8 ms.
Shape 2: Attention QK at moderate context (compute-bound, lower headroom)
Attention QKᵀ at 4K seq, head_dim=128, 32 heads, batch 1: M = 4096, N = 4096, K = 128, fp16.
flops = 2 × 4096 × 4096 × 128 ≈ 4.3 × 10⁹ flops
bytes = (4096·128 + 128·4096 + 4096·4096) × 2 ≈ 35.6 × 10⁶ bytes
AI ≈ 121 flops/byteAI is 40% of the ridge point → memory-bound, but close to the corner. Predicted: hit ~50–60% of the bandwidth ceiling, since the kernel is small and SMEM tiling can mostly hide HBM latency. Time = 35.6e6 / (3.35e12 × 0.55) ≈ 19 µs. FlashAttention’s whole pitch is fusing the softmax to remove one of these traffic legs.
Shape 3: Decode GEMV (HBM-bound, hard ceiling)
Llama 70B FFN during decode at batch 1: M = 1, N = 28672, K = 8192, fp16 weights.
flops = 2 × 1 × 28672 × 8192 ≈ 4.7 × 10⁸ flops
bytes ≈ K·N × 2 (weight matrix dominates) ≈ 470 × 10⁶ bytes
AI ≈ 1.0 flop/byteAI is 0.3% of the ridge point → pure HBM-bound. The compute peak is irrelevant. Predicted ceiling = 470e6 / 3.35e12 ≈ 140 µs at 100% of HBM peak; production kernels (Marlin INT4) hit ~70% of HBM, so expect ~200 µs per token, ~5,000 tokens/s weight throughput at the kernel level — and the model has dozens of these per layer per token.
This is the regime where INT4 weight quantization is a 4× speedup almost mechanically: AI stays at ~1.0, but bytes moved drops 4×. The roofline doesn’t care that you have a fancy kernel; it cares about the bytes.
The third regime — overhead-bound
The textbook roofline has two regimes. Real GPU code has three. When the kernel is small enough that host-side launch + kernel-launch + dispatcher cost exceeds the kernel’s actual execution time, the GPU is starved waiting for the next launch. The kernel runs at well below either ceiling because it spends most wall-clock idle.
Symptoms: achieved bandwidth and achieved TC% are both low. NCU’s sm__cycles_active is far below sm__cycles_elapsed. The fix is not kernel optimization — it’s CUDA Graphs, kernel fusion, larger batches, or persistent kernels. Spec decoding’s verifier kernel is the canonical case: a 32-token verify is too small to amortize launch.
Rough cutoff on H100: kernels under ~8 µs of GPU work spend more time being launched than being run. Plan around it.
Concrete walkthrough — Triton autotuned matmul prediction
Take the Triton matmul tutorial and predict its perf at three shapes before running. This is the workflow Capstone 1 requires you to internalize.
# H100 SXM, fp16 inputs, fp32 accumulator
def predict_matmul(M, N, K, dtype_bytes=2,
peak_tflops=989, hbm_tbps=3.35,
realistic_eff=0.78):
flops = 2 * M * N * K
bytes_traffic = (M*K + K*N + M*N) * dtype_bytes
ai = flops / bytes_traffic
ridge = peak_tflops * 1e12 / (hbm_tbps * 1e12) # ≈ 295
bound = "compute" if ai > ridge else "HBM"
if bound == "compute":
sustained_tflops = peak_tflops * realistic_eff
time_us = flops / (sustained_tflops * 1e12) * 1e6
else:
sustained_bw = hbm_tbps * 1e12 * realistic_eff
time_us = bytes_traffic / sustained_bw * 1e6
return ai, bound, round(time_us, 1)
print(predict_matmul(8192, 28672, 8192)) # (2725, 'compute', 4858 µs)
print(predict_matmul(4096, 4096, 128)) # (121, 'HBM', 14 µs)
print(predict_matmul(1, 28672, 8192)) # (1.0, 'HBM', 180 µs)Then run the actual kernel and compare. If your kernel hits 78% of the predicted ceiling, you’re in the band where production kernels live and the win is pursuing the next bottleneck. If it hits 30%, the kernel is broken — wrong tile shape, wrong dtype path, missing TC, occupancy floor — and NCU will tell you which.
Run it in your browser — predict, then read the gap
What you should see: the training FFN has AI ≈ 2725, comfortably compute-bound. Attention QK is HBM-bound at 4K seq (FlashAttention exists for a reason). The decode GEMV is ten thousand times below the ridge — quantization is the only lever. The tiny softmax slice trips the overhead floor — fuse it into the kernel before, not after.
Quick check
Key takeaways
- The roofline is a predictive tool, not a plot. The interview-grade skill is computing AI from a workload definition and predicting regime + % of peak before any code runs.
- Memorize H100 numbers. Ridge point fp16 ≈ 295 flops/byte. HBM3 ≈ 3.35 TB/s. Peak TC fp16 ≈ 989 TFLOPs/s. Without these, you can’t predict; with them, predictions take 30 seconds.
- Three regimes, not two. Compute-bound, HBM-bound, overhead-bound. Sub-8-µs kernels live in the third regime; the fix is fusion or graphs, not kernel optimization.
- Production lands at 65–85% of the predicted ceiling. Below 50% means a bug. Above 90% means you re-measure (or the kernel is hitting cache, not HBM).
- The discipline: seal predictions before profiling. Capstone 1 commits
predictions.mdwith a timestamp before opening NCU. Every measured value gets reconciled against its prediction; every gap is a learnable fact.
Go deeper
- BlogMaking Deep Learning Go Brrrr From First PrinciplesThe canonical essay on the three regimes (compute / memory / overhead). Re-read once a year.
- PaperRoofline: An Insightful Visual Performance ModelThe original roofline paper. Pre-dates GPU dominance but the math is identical.
- DocsNCU — Roofline ChartsHow NCU plots achieved AI and bandwidth on the roofline directly. The verification half of predict-then-verify.
- PaperHopper Architecture WhitepaperThe source of every H100 number you should memorize. Read sections on HBM3, TC peak, and SM resources.
- BlogStrangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" DataReal-world example of where the simple roofline misses — clock throttling under sustained load.
- PaperOutperforming cuBLAS on H100 — Worked Example with FP16A real kernel that lands at 80%+ of TC peak with the gap explained metric-by-metric. The verification half done well.
- VideoTri Dao — Hardware/Software Co-Design for AIHow FlashAttention's author thinks about the roofline when designing kernels.
TL;DR
- The roofline is a two-line graph: peak compute (FLOPs/s) and peak memory bandwidth (bytes/s). A kernel’s arithmetic intensity (AI = flops / bytes moved) places it on the x-axis. Where the kernel meets the lower of the two ceilings is its theoretical max.
- The ridge point is where the two lines cross: AI = peak_FLOPs / peak_bandwidth. On H100 fp16: AI ≈ 295 FLOPs/byte. Below it, you’re HBM-bound. Above it, you’re compute-bound.
- A useful roofline has a third regime: kernel-launch + dispatcher overhead. Below ~8 µs, the host-side cost dominates and the GPU is starved. Common in batch-1 decode and tiny attention.
- The discipline: predict regime + % of peak BEFORE profiling. Then NCU verifies. Where prediction is wrong, you learn something specific (tile size, occupancy floor, SMEM thrash, scheduler stall).
- Production kernels land at 65–85% of the predicted ceiling, not 100%. Anything below 50% means a bug; anything claiming above 90% should be re-measured.
Why this matters
Every kernel optimization decision starts with one question: which resource is this kernel actually waiting on? Compute, HBM, SMEM, registers, scheduler, host launch — pick wrong and the entire optimization session is wasted. The roofline answers that question before any code runs, by reducing the workload to a single number (arithmetic intensity) and comparing it to a hardware constant (ridge point).
The reason this matters specifically in 2026 is that LLM inference is dominated by HBM-bound regimes. Decode at batch 1 has AI near 1.0; attention at long context is HBM-bound until FlashAttention fuses the softmax leg. If you cannot derive that from a roofline argument in 30 seconds, every PR you propose to vLLM or SGLang will be reviewed by someone who can.
Mental model
H100 numbers — the achievable peaks
| Resource | H100 SXM (achievable) | NCU metric |
|---|---|---|
| HBM3 bandwidth | 3.35 TB/s | dram__bytes.sum.per_second |
| L2 cache | 50 MB | lts__t_bytes.sum.per_second |
| SMEM per SM | 228 KB | smsp__inst_executed_pipe_lsu.avg.per_cycle |
| Registers per SM | 65,536 | launch__registers_per_thread |
| SMs per chip | 132 (SXM) | nvidia-smi -q |
| Peak fp16/bf16 TC | 989 TFLOPs/s | sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_active |
| Peak fp8 TC | 1979 TFLOPs/s | sm__pipe_tensor_op_imma_cycles_active.avg.pct_of_peak_sustained_active |
| Ridge point fp16 | AI ≈ 295 flops/byte | derived |
| Ridge point fp8 | AI ≈ 590 flops/byte | derived |
Marketing peaks (fp16: 1979 TFLOPs) include 2:4 sparsity. Production kernels dense — use 989. PCIe variant peaks ~25% lower across the board.
Three predictive regimes
Compute-bound: AI ≫ ridge
Training-scale GEMM (M, N, K all large) lands here. AI for an (M, N, K) GEMM is approximately (M·N·K) / (M·K + K·N + M·N) ≈ min(M, N, K) / 2 when all dimensions are large. For Llama FFN at training: AI ≈ 2725. Predicted execution time ≈ flops / (peak × η) where η ∈ [0.65, 0.85] for production kernels.
HBM-bound: AI < ridge
Decode at batch 1 dominates here. The matrix-vector product reads the full weight matrix once per token; AI ≈ 1.0. Predicted execution time ≈ bytes / (HBM × η). The compute ceiling is irrelevant — quantization (INT4, FP8) cuts bytes proportionally.
Overhead-bound: kernel < ~8 µs
Below ~8 µs of GPU-side work, host launch + dispatcher dominate. Symptoms: NCU shows both dram__bytes/sec and sm__pipe_tensor_op_* low; sm__cycles_active.avg.pct_of_peak < 50%. Fix with CUDA Graphs, op fusion, larger batches, persistent kernels.
Computing AI for the operations that matter
| Op | AI formula (large dims) | Typical regime |
|---|---|---|
| GEMM (M, N, K) | ≈ min(M, N, K) / 2 | compute when all > 256, else HBM |
| GEMV (1, N, K) | ≈ 1 | HBM-bound (decode) |
| Elementwise unary | ≈ 0.5 | HBM-bound |
| Elementwise binary | ≈ 0.33 | HBM-bound |
| Softmax (per row) | ≈ 1 | HBM-bound (often fused) |
| RMSNorm (per row) | ≈ 1 | HBM-bound (fuse into next matmul) |
| Attention QK + softmax + V (unfused) | ≈ 4–8 | HBM-bound |
| FlashAttention (fused) | ≈ d_head / 2 | compute when d_head ≥ 128 |
The “fuse into next matmul” pattern is the workshop tool: an HBM-bound elementwise op fused into a downstream compute-bound matmul costs nothing on the matmul side and removes the elementwise’s HBM round-trip entirely. This is exactly what the Capstone 1 fused RMSNorm+Matmul does.
Worked predictions — three shapes
def predict(M, N, K, dtype_bytes=2,
peak_tflops=989, hbm_tbps=3.35,
realistic_eff=0.78):
flops = 2 * M * N * K
bytes_traffic = (M*K + K*N + M*N) * dtype_bytes
ai = flops / bytes_traffic
ridge = peak_tflops / hbm_tbps # ≈ 295
if ai > ridge:
time_us = flops / (peak_tflops * realistic_eff * 1e12) * 1e6
regime = "compute"
else:
time_us = bytes_traffic / (hbm_tbps * realistic_eff * 1e12) * 1e6
regime = "HBM"
if time_us < 8: regime = "overhead"
return ai, regime, time_us
predict(8192, 28672, 8192) # AI=2725, compute, ~4860 µs
predict(4096, 4096, 128) # AI=121, HBM, ~14 µs
predict(1, 28672, 8192) # AI=1.0, HBM, ~180 µsConcrete walkthrough — verifying with NCU
After predicting, the verification step in NCU has a fixed shape:
- Launch with
ncu --set full --kernel-name my_kernel python bench.py.--set fullcaptures everything; trim later with--metricsfor production sweeps. - Read the regime confirmation first. For a predicted HBM-bound kernel, look at
dram__bytes.sum.per_secondand convert to a fraction of HBM peak. For compute-bound,sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_active. - Compute achieved fraction-of-prediction. If predicted 78% of HBM and achieved 71%, gap is 7 percentage points — explainable by SMEM bank conflicts, occupancy floor below 100%, or partial L2 reuse.
- For overhead-bound, look at
sm__cycles_active.avg.pct_of_peak_sustained_elapsed. Below 50% means the GPU is idle waiting for launches; the next move is fusion or CUDA Graphs, not kernel work.
The output of this loop is not “the kernel is fast.” It’s “the kernel is fast because X, and the gap to peak is because Y.” Y is the next kernel session.
Real numbers — production rooflines
Where production kernels land on H100 fp16:
| Kernel | AI | Regime | Achieved % of ceiling | Notes |
|---|---|---|---|---|
| cuBLAS GEMM (M=N=K=8192) | 4096 | compute | 78% TC | Default eager PyTorch |
| CUTLASS hand-tuned (same) | 4096 | compute | 84% TC | The 2024 Hopper paper |
| Triton autotuned matmul | 4096 | compute | 71% TC | Easier to write; smaller win |
| Marlin INT4 GEMV (decode) | 1.1 | HBM | 70% HBM | INT4 weights, fp16 acts |
| FlashAttention-3 (head=128) | 64 | mixed | 90% TC | Async pipeline + warp spec |
| Naive PyTorch attention | 4 | HBM | 22% HBM | Unfused softmax; the worst case |
The right reading: when an inference team interviews you, “I’d expect 78% of TC peak for cuBLAS-style GEMM and ~70% of HBM for INT4 decode” is a senior answer. “It depends” is not.
Quick check
Key takeaways
- The roofline is a predictive tool, not a plot. Senior inference engineers compute AI in their head and predict regime + % of peak before profiling.
- H100 fp16 ridge point ≈ 295 flops/byte. Below: HBM-bound. Above: compute-bound. Below ~8 µs of work: overhead-bound.
- Production lands at 65–85% of the relevant ceiling. Use 78% as a default for “well-tuned” predictions; tighten with measurement.
- HBM-bound regimes don’t reward kernel work. They reward reading less — quantization, fusion, prefix cache.
- Predict-then-verify is the discipline. Seal
predictions.mdbefore opening NCU. Reconcile every gap; that reconciliation is the senior signal.
Go deeper
- BlogMaking Deep Learning Go Brrrr From First PrinciplesThe canonical three-regime essay.
- PaperRoofline: An Insightful Visual Performance Model
- DocsNCU — Roofline Charts
- PaperHopper Architecture Whitepaper
- PaperOutperforming cuBLAS on H100 — Worked Example FP16
- BlogMatrix Multiplications Run Faster on Predictable Data
- VideoTri Dao — Hardware/Software Co-Design for AI