Skip to content

Roofline & Profiling

The bedrock-meets-tooling layer. Every kernel optimization starts with the same two questions: which resource is this kernel bound by? and how do I verify that with the profiler? This module makes both answers reflexive — the roofline is the predictive lens, NCU is the verification surface, and Tensor Core shape constraints are the silent floor underneath both.

The discipline is predict, then verify. Given a (shape, dtype, hardware) triple, you should be able to predict the regime — HBM-bandwidth-bound, compute-bound on Tensor Cores, SMEM-thrash, or kernel-launch-overhead — and a rough % of peak, before you ever open a profiler. Then NCU confirms the prediction, and every gap is a learnable lesson. Most engineers skip the prediction step; that’s the gap between intermediate and senior inference roles.