Profiling Tools
There’s a recurring failure mode in performance work. An engineer reads about cache misses and branch mispredicts and SIMD intrinsics and immediately starts rewriting the matmul they’ve convinced themselves is the bottleneck. A week later, the program is no faster. They profile — and discover the matmul was 8% of total time. The other 92% was a Python list comprehension three calls up the stack.
The lesson is older than computers: the bottleneck is never where you expect. Profiling is the discipline of replacing that surprise with measurement before you spend a week on the wrong fix.
This lesson is a tour of the tools — perf for CPU work, nsys for the whole-program GPU timeline, ncu for per-kernel deep dives, torch.profiler for the AI-shaped lens, for production observability — and how to thread them together into a profile-fix-reprofile loop that actually shrinks wall-clock time.
TL;DR
- You can’t optimize what you can’t measure. Profiling is the discipline of turning “this is slow” into “this specific thing is using N% of cycles for reason X.”
- CPU profilers:
perf(Linux),Instruments(macOS), VTune (Intel). Sample-based — interrupt periodically, record the call stack, aggregate. - GPU profilers: NVIDIA Nsight Systems (
nsys) for whole-program timeline; Nsight Compute (ncu) for per-kernel deep dive. AMD has rocm-smi + rocprof. - tools — bpftrace, BCC, Pixie — let you instrument the running kernel without modifying source. Modern Linux observability.
- PyTorch
torch.profileris the AI-specific tool: layer-level traces, kernel attribution, memory snapshots, integration with Chrome’s perfetto for visualization. - Always profile before optimizing. “I thought it was the matmul; turned out it was a Python list comprehension” is the most common debugging story.
The loop
Profile → diagnose → fix → re-profile. Skip the loop and you’ll optimize the wrong thing.
perf — the Linux CPU profiler
# Sample the running program for 10 seconds; capture call stacks
perf record -F 999 -g -p $(pidof my_program) -- sleep 10
# Show the hot functions
perf report
# Or generate a flame graph (FlameGraph project on GitHub)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svgperf interrupts the program ~1000 times per second, records the call stack at each interrupt, and reports which functions appear most. Free, low-overhead (~3% slowdown), the default Linux profiler.
The (Brendan Gregg’s invention) is the canonical visualization: x-axis = sample count (proportional to time), y-axis = call-stack depth. Wide stacks = hot. Read flame graphs once and you’ll never want a flat profile again.
Hardware counters via perf
perf also exposes CPU performance counters:
# Cache misses, branch mispredicts, IPC
perf stat -e cache-misses,branch-misses,instructions,cycles ./my_program
# Output:
# 23,000,000 cache-misses # 8.5% of all cache refs
# 1,200,000 branch-misses # 1.5% of all branches
# 9,800,000,000 instructions # 2.5 insn per cycle
# 3,900,000,000 cyclesCounters reveal why code is slow:
- High
cache-missesrate → check data layout, Cache Lines. - High
branch-missesrate → unpredictable conditionals, see Branch Prediction. - < 1.0 → likely memory-bound. IPC > 2.0 → compute-bound, work harder on FLOPs.
Nsight Systems (nsys) — GPU timeline
nsys profile --trace=cuda,nvtx,osrt -o report.qdrep python train.py
nsys-ui report.qdrep # GUI; or report.qdrep can be opened in a web viewernsys captures every CUDA API call, kernel launch, and CPU-side event into a timeline. Open it; see your training step laid out in time:
- Forward pass kernels in order, with names from cuBLAS / cuDNN / Triton.
- Backward pass, ditto.
- AllReduce / AllGather (NCCL) calls, with bandwidth.
- CPU-side Python overhead (dataloader, optimizer step Python code).
- Idle time between steps.
The single best way to see “where is time going” in a real training run. Production debugging always starts here.
Nsight Compute (ncu) — per-kernel deep dive
When nsys shows kernel X is slow, ncu tells you why:
ncu --set full --kernel-name my_kernel python script.pyOutput (per kernel invocation): tensor-core utilization, SMEM bank conflicts, occupancy, warp scheduler stalls, DRAM bandwidth, L1/L2 hit rates. Hundreds of metrics. The single tool every kernel author uses.
Key sections to read:
sm__pipe_tensor_op_cycles_active.avg.pct_of_peak_sustained_active— tensor-core utilization. >70% = healthy.l1tex__shared_st_bank_conflict.sumand_ld_bank_conflict.sum— SMEM conflicts. >0 = fix.smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct_of_peak_sustained_elapsed— coalesced load efficiency.- Occupancy breakdown — which resource (registers, SMEM, warp slots) is the limit.
torch.profiler — the AI-specific lens
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=2, warmup=2, active=10),
on_trace_ready=lambda p: p.export_chrome_trace("trace.json"),
record_shapes=True,
) as prof:
for batch in loader:
train_step(batch)
prof.step()The output trace.json opens in Chrome’s chrome://tracing (or perfetto.dev) with PyTorch op names, shapes, kernel attribution. The right level of abstraction for ML perf work: aggregates above raw kernels, attributable to your model code.
Key views:
- Tracer view: timeline of ops. Look for gaps (CPU bottleneck) or huge ops (kernel issue).
- Operator stats: aggregate time per torch op. The “self CUDA time” column is what to sort by.
- Memory profile (
profile_memory=True): allocations / frees over time. Spot fragmentation.
— observability without instrumentation
Modern Linux supports running BPF programs in the kernel that hook into syscalls, function entries, hardware counters. Tools:
bpftrace: ad-hoc one-liners.bpftrace -e 'tracepoint:syscalls:sys_enter_read { @[comm] = count(); }'counts read() calls per process.- BCC: Python wrapper for BPF. Many ready-to-use tools:
tcptracer,gpuperf, etc. - Pixie / Parca / Polar Signals: production-grade always-on profilers built on eBPF.
For ML production, eBPF profilers (Parca, Polar Signals) let you see live CPU profiles of every process in a Kubernetes cluster with ~1% overhead. The 2024+ standard for “what is my fleet doing right now.”
A complete profiling session
For “training run is slow”:
- Wall-clock per step: Add
start.record(); step(); end.record(); torch.cuda.synchronize(); print(start.elapsed_time(end)). Establish baseline. nsys profile: capture 100 steps. Open the timeline.- Identify gaps: CPU bottleneck? GPU idle? Skewed across nodes (DP imbalance)?
- If GPU compute is hot: identify the slowest kernel via
nsys, thenncuit. - If CPU is the bottleneck:
py-spyon the Python process. Often dataloader. - If comm is the bottleneck: NCCL_DEBUG=INFO; check topology + transport.
- Fix the top thing, re-profile.
This loop is what every distributed-training engineer runs through monthly.
Run it in your browser — toy profiler
This 30-line sampler is the conceptual core of perf. Real profilers add: kernel-side stack capture (no GIL), hardware counters, kernel symbols, multi-process aggregation. The algorithm is the same.
Quick check
Key takeaways
- Profile before you optimize. It is always cheaper than guessing.
perffor CPU,nsysfor GPU timeline,ncufor per-kernel. Plustorch.profilerfor ML-specific.- Hardware counters reveal cause: cache-misses, branch-misses, IPC, tensor-core utilization.
- eBPF is the modern Linux observability layer — tools like Parca and Polar Signals make always-on production profiling cheap.
- The loop is profile → fix top item → re-profile. Don’t skip the re-profile — fixes often surface a new bottleneck.
Go deeper
- BlogBrendan Gregg — perf ExamplesCanonical reference. Memorize the FlameGraph workflow.
- DocsNVIDIA — Nsight SystemsThe product. "Quick Start" + "Best Practices" are the right pages.
- DocsNVIDIA — Nsight ComputeFor per-kernel deep dive. The "Metrics Reference" is huge but you only need to know ~20 metrics for daily work.
- DocsPyTorch torch.profilerAuthoritative. Section on "Holistic Trace Analysis" covers the multi-step workflow.
- BlogFlameGraphs — Brendan GreggThe visualization technique that changed how every engineer reads profiler output.
- DocseBPF.ioModern Linux observability. The "What is eBPF" intro is the right starting point.
- Repoiovisor/bccBattle-tested BPF tools. `tools/` directory is a treasure chest.
TL;DR
- You can’t optimize what you can’t measure. Profiling is the discipline of turning “this is slow” into “this specific thing is using N% of cycles for reason X.”
- CPU profilers:
perf(Linux),Instruments(macOS), VTune (Intel). Sample-based — interrupt periodically, record the call stack, aggregate. - GPU profilers: NVIDIA Nsight Systems (
nsys) for whole-program timeline; Nsight Compute (ncu) for per-kernel deep dive. AMD has rocm-smi + rocprof. - eBPF tools — bpftrace, BCC, Pixie — let you instrument the running kernel without modifying source. Modern Linux observability.
- PyTorch
torch.profileris the AI-specific tool: layer-level traces, kernel attribution, memory snapshots, integration with Chrome’s perfetto for visualization. - Always profile before optimizing. “I thought it was the matmul; turned out it was a Python list comprehension” is the most common debugging story.
Why this matters
Performance work follows a simple rule: the bottleneck is never where you expect. Mosaic’s earlier lessons gave you mental models for cache misses, branch prediction, allocator overhead, comm patterns — but applying them blindly is rearranging deck chairs. The first 30% of any optimization session is profiling: capture a representative workload, sort hot paths, identify the costliest contributions, then apply the right knowledge. Engineers who skip the profile stage burn weeks on the wrong thing.
Mental model
Profile → diagnose → fix → re-profile. Skip the loop and you’ll optimize the wrong thing.
Concrete walkthrough
perf — the Linux CPU profiler
# Sample the running program for 10 seconds; capture call stacks
perf record -F 999 -g -p $(pidof my_program) -- sleep 10
# Show the hot functions
perf report
# Or generate a flame graph (FlameGraph project on GitHub)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svgperf interrupts the program ~1000 times per second, records the call stack at each interrupt, and reports which functions appear most. Free, low-overhead (~3% slowdown), the default Linux profiler.
The flame graph (Brendan Gregg’s invention) is the canonical visualization: x-axis = sample count (proportional to time), y-axis = call-stack depth. Wide stacks = hot. Read flame graphs once and you’ll never want a flat profile again.
Hardware counters via perf
perf also exposes CPU performance counters:
# Cache misses, branch mispredicts, IPC
perf stat -e cache-misses,branch-misses,instructions,cycles ./my_program
# Output:
# 23,000,000 cache-misses # 8.5% of all cache refs
# 1,200,000 branch-misses # 1.5% of all branches
# 9,800,000,000 instructions # 2.5 insn per cycle
# 3,900,000,000 cyclesCounters reveal why code is slow:
- High
cache-missesrate → check data layout, Cache Lines. - High
branch-missesrate → unpredictable conditionals, see Branch Prediction. - IPC < 1.0 → likely memory-bound. IPC > 2.0 → compute-bound, work harder on FLOPs.
Nsight Systems (nsys) — GPU timeline
nsys profile --trace=cuda,nvtx,osrt -o report.qdrep python train.py
nsys-ui report.qdrep # GUI; or report.qdrep can be opened in a web viewernsys captures every CUDA API call, kernel launch, and CPU-side event into a timeline. Open it; see your training step laid out in time:
- Forward pass kernels in order, with names from cuBLAS / cuDNN / Triton.
- Backward pass, ditto.
- AllReduce / AllGather (NCCL) calls, with bandwidth.
- CPU-side Python overhead (dataloader, optimizer step Python code).
- Idle time between steps.
The single best way to see “where is time going” in a real training run. Production debugging always starts here.
Nsight Compute (ncu) — per-kernel deep dive
When nsys shows kernel X is slow, ncu tells you why:
ncu --set full --kernel-name my_kernel python script.pyOutput (per kernel invocation): tensor-core utilization, SMEM bank conflicts, occupancy, warp scheduler stalls, DRAM bandwidth, L1/L2 hit rates. Hundreds of metrics. The single tool every kernel author uses.
Key sections to read:
sm__pipe_tensor_op_cycles_active.avg.pct_of_peak_sustained_active— tensor-core utilization. >70% = healthy.l1tex__shared_st_bank_conflict.sumand_ld_bank_conflict.sum— SMEM conflicts. >0 = fix.smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct_of_peak_sustained_elapsed— coalesced load efficiency.- Occupancy breakdown — which resource (registers, SMEM, warp slots) is the limit.
torch.profiler — the AI-specific lens
from torch.profiler import profile, ProfilerActivity, schedule
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=2, warmup=2, active=10),
on_trace_ready=lambda p: p.export_chrome_trace("trace.json"),
record_shapes=True,
) as prof:
for batch in loader:
train_step(batch)
prof.step()The output trace.json opens in Chrome’s chrome://tracing (or perfetto.dev) with PyTorch op names, shapes, kernel attribution. The right level of abstraction for ML perf work: aggregates above raw kernels, attributable to your model code.
Key views:
- Tracer view: timeline of ops. Look for gaps (CPU bottleneck) or huge ops (kernel issue).
- Operator stats: aggregate time per torch op. The “self CUDA time” column is what to sort by.
- Memory profile (
profile_memory=True): allocations / frees over time. Spot fragmentation.
eBPF — observability without instrumentation
Modern Linux supports running BPF programs in the kernel that hook into syscalls, function entries, hardware counters. Tools:
bpftrace: ad-hoc one-liners.bpftrace -e 'tracepoint:syscalls:sys_enter_read { @[comm] = count(); }'counts read() calls per process.- BCC: Python wrapper for BPF. Many ready-to-use tools:
tcptracer,gpuperf, etc. - Pixie / Parca / Polar Signals: production-grade always-on profilers built on eBPF.
For ML production, eBPF profilers (Parca, Polar Signals) let you see live CPU profiles of every process in a Kubernetes cluster with ~1% overhead. The 2024+ standard for “what is my fleet doing right now.”
A complete profiling session
For “training run is slow”:
- Wall-clock per step: Add
start.record(); step(); end.record(); torch.cuda.synchronize(); print(start.elapsed_time(end)). Establish baseline. nsys profile: capture 100 steps. Open the timeline.- Identify gaps: CPU bottleneck? GPU idle? Skewed across nodes (DP imbalance)?
- If GPU compute is hot: identify the slowest kernel via
nsys, thenncuit. - If CPU is the bottleneck:
py-spyon the Python process. Often dataloader. - If comm is the bottleneck: NCCL_DEBUG=INFO; check topology + transport.
- Fix the top thing, re-profile.
This loop is what every distributed-training engineer runs through monthly.
Run it in your browser — toy profiler
This 30-line sampler is the conceptual core of perf. Real profilers add: kernel-side stack capture (no GIL), hardware counters, kernel symbols, multi-process aggregation. The algorithm is the same.
Quick check
Key takeaways
- Profile before you optimize. It is always cheaper than guessing.
perffor CPU,nsysfor GPU timeline,ncufor per-kernel. Plustorch.profilerfor ML-specific.- Hardware counters reveal cause: cache-misses, branch-misses, IPC, tensor-core utilization.
- eBPF is the modern Linux observability layer — tools like Parca and Polar Signals make always-on production profiling cheap.
- The loop is profile → fix top item → re-profile. Don’t skip the re-profile — fixes often surface a new bottleneck.
Go deeper
- BlogBrendan Gregg — perf ExamplesCanonical reference. Memorize the FlameGraph workflow.
- DocsNVIDIA — Nsight SystemsThe product. "Quick Start" + "Best Practices" are the right pages.
- DocsNVIDIA — Nsight ComputeFor per-kernel deep dive. The "Metrics Reference" is huge but you only need to know ~20 metrics for daily work.
- DocsPyTorch torch.profilerAuthoritative. Section on "Holistic Trace Analysis" covers the multi-step workflow.
- BlogFlameGraphs — Brendan GreggThe visualization technique that changed how every engineer reads profiler output.
- DocseBPF.ioModern Linux observability. The "What is eBPF" intro is the right starting point.
- Repoiovisor/bccBattle-tested BPF tools. `tools/` directory is a treasure chest.