torch.compile (Inductor)
If you train or serve PyTorch models in 2026, your code is probably running through a compiler — even if you didn’t ask for one. torch.compile(model) is on by default in many training stacks, and almost every commercial PyTorch deployment in production today is going through whether the team realized it or not. It’s the most-deployed AI compiler in 2026 by orders of magnitude.
The pipeline has four stages, each addressing one of PyTorch’s eager-mode pain points. hooks into CPython’s bytecode evaluation to capture graphs from arbitrary Python code. AOTAutograd lifts the dynamic backward pass into a static joint forward-backward graph. Inductor does and emits kernels for GPU + C++ for CPU. captures the whole step’s launches and replays them with one CPU API call. Together: 1.3–2× speedup on training, 1.5–4× on inference, mostly free.
This lesson is what each stage actually does, what blocks the speedup, and how to debug when torch.compile doesn’t help as much as you expected.
TL;DR
torch.compile(model)swaps PyTorch’s eager execution for graph capture (Dynamo) + autograd compilation (AOTAutograd) + kernel codegen (Inductor).- Dynamo captures Python-level graphs by hooking into bytecode evaluation, falling back gracefully when it can’t trace.
- AOTAutograd lifts the Python autograd into a static forward+backward joint graph that the compiler can optimize together.
- Inductor lowers the graph to Triton kernels for GPU and C++ + OpenMP for CPU. Fuses pointwise ops aggressively.
- Real-world wins: 1.3–2× speedup on training, 1.5–4× on inference, mostly from kernel fusion and CUDA Graph capture. The default in PyTorch 2.5+.
Mental model
Four compilation stages, each addressing one of PyTorch’s eager-mode pain points: capture (Dynamo), autograd (AOT), codegen (Inductor), kernel launch overhead (CUDA Graphs).
Stage 1: Dynamo — graph capture from Python bytecode
The hardest part of compiling PyTorch is that user code is Python, with arbitrary control flow, dictionary lookups, and dynamic shapes. Dynamo solves this by hooking into CPython’s frame evaluation API and rewriting bytecode on the fly:
- For each
forward()call, Dynamo runs the Python function and traces tensor ops into an FX graph. - When it hits Python code it can’t trace (a
print, a non-tensor branch), it inserts a graph break, runs the un-traceable bit in eager mode, then resumes tracing on the next tensor op. - The captured graph is canonical — same code runs same graph regardless of Python conditionals on tensor data.
@torch.compile
def f(x):
if x.sum() > 0: # tensor predicate — graph break here, both branches traced
return x * 2
return x + 1
# Dynamo emits two graphs (the two branches) and dispatches based on the actual tensor predicate at runtimeThis works far better than the previous torch.jit.trace (which silently traced one path and missed conditionals) or torch.jit.script (which required rewriting code to a Python subset).
Stage 2: AOTAutograd — joint forward+backward
Eager-mode autograd records the backward graph dynamically during the forward pass — fine for flexibility, terrible for compilers that want to optimize forward and backward together (so they can fuse a forward op with the activation it would have saved for backward).
AOTAutograd traces the forward and produces the functionalized joint graph (forward + backward together) ahead of time. Inductor then optimizes the whole thing.
This is why torch.compile speeds up training and not just inference — most of the wins live in fusing forward and backward together.
Stage 3: Inductor — fusion and codegen
Inductor’s job: take the graph, fuse what’s fusible, emit kernels.
Input graph:
y = (x + bias).relu()
z = y * weight
Naive eager: 3 kernels (add, relu, mul) — 3 reads + 3 writes from HBM.
Inductor: 1 fused kernel — 1 read of x + bias + weight, 1 write of z.Pointwise ops are fused aggressively. Reductions and matmuls are kept separate (those have different optimization regimes). On GPU, Inductor emits Triton for fused pointwise/reduction kernels and falls back to cuBLAS/CUTLASS for GEMM. On CPU, it emits C++ with OpenMP.
You can dump the generated code:
TORCH_LOGS=output_code python train.pyThis dumps every Triton kernel Inductor generated. Best way to learn Inductor: read its output.
Stage 4: CUDA Graph capture
Even after fusion, launching small kernels has overhead — each cudaLaunchKernel is ~5–10 µs. For a model with 100 small kernels per step, that’s 1 ms of pure launch overhead per step.
torch.compile(mode="reduce-overhead") or mode="max-autotune" enables CUDA graph capture: the whole step’s launches are recorded once into a graph, then replayed with one CPU API call. Eliminates launch overhead at the cost of static input shapes.
Real numbers — Llama-3.1 8B inference, RTX 4090
| Mode | tok/s (decode) | Speedup |
|---|---|---|
| Eager PyTorch | 32 | 1.0× |
torch.compile (default) | 51 | 1.6× |
torch.compile(mode="reduce-overhead") | 68 | 2.1× |
torch.compile(mode="max-autotune") | 75 | 2.3× |
max-autotune runs Inductor’s autotuner over GEMM tile shapes — adds compile time (a minute or two) but finds better kernels.
What torch.compile won’t do
- Won’t fuse across function-call boundaries that involve non-tensor Python (graph breaks).
- Won’t fix bad data layouts — if your model’s tensors are non-contiguous, Inductor produces non-contiguous-friendly kernels but they’re still slower than contiguous.
- Won’t replace specialized kernels — FlashAttention-3, Paged-Attention, custom Triton blow past Inductor’s defaults on attention. Inductor will use them if they’re available; otherwise it generates baseline kernels.
- Won’t accelerate dynamic-shape-heavy workloads — extreme shape variance triggers re-compilation. Use
dynamic=Trueor hard-code padding.
Run it in your browser — see fusion shrink kernel count
Inductor in production does much more sophisticated fusion (reductions, layout transformations, etc.), but the pattern — group pointwise runs into single fused kernels — is the heart of the win.
Quick check
Key takeaways
torch.compileis the daily-driver ML compiler in 2026. Default in PyTorch 2.5+; production-validated.- Dynamo + AOTAutograd + Inductor + CUDA Graphs — four stages, each addressing a different eager-mode pain point.
- The wins are real but uneven. 1.3–2× on training, 1.5–4× on inference, depending on how much eager mode was leaving on the table.
mode="reduce-overhead"andmode="max-autotune"exist — try them; modest extra compile time, often noticeable extra speedup.- Read Inductor’s output with
TORCH_LOGS=output_code— it’s the best way to learn how the compiler thinks.
Go deeper
- PaperPyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode TransformationThe PyTorch 2 paper. Authoritative reference for Dynamo + AOT + Inductor.
- DocsPyTorch — torch.compiler docsOperational reference. Modes, options, troubleshooting.
- BlogHorace He — What's up with PyTorch 2.0The PyTorch core dev who built much of this writing about how it works.
- VideoEdward Yang — Inside InductorAuthoritative talk on what Inductor actually does.
- DocsPyTorch — torch.compile troubleshootingWhen it doesn't speed up, this page is the first stop.
- Repopytorch/_inductor sourceThe actual codegen. Read `lowering.py` and the Triton template files.
- BlogPyTorch Blog — torch.compile + FSDP2 + Float8How torch.compile composes with the rest of the modern training stack.
- DocsTriton documentationThe DSL Inductor codegens to. Worth understanding once.
TL;DR
torch.compile(model)swaps PyTorch’s eager execution for graph capture (Dynamo) + autograd compilation (AOTAutograd) + kernel codegen (Inductor).- Dynamo captures Python-level graphs by hooking into bytecode evaluation, falling back gracefully when it can’t trace.
- AOTAutograd lifts the Python autograd into a static forward+backward joint graph that the compiler can optimize together.
- Inductor lowers the graph to Triton kernels for GPU and C++ + OpenMP for CPU. Fuses pointwise ops aggressively.
- Real-world wins: 1.3–2× speedup on training, 1.5–4× on inference, mostly from kernel fusion and CUDA Graph capture. The default in PyTorch 2.5+.
Why this matters
torch.compile is the most-deployed ML compiler in 2026 — by orders of magnitude. Almost every commercial PyTorch deployment in production today is running through Inductor whether the team realized it or not. If you’re going to operate on PyTorch models, this is the compiler you’ll see in profiles and the one you’ll need to debug.
Mental model
Four compilation stages, each addressing one of PyTorch’s eager-mode pain points: capture (Dynamo), autograd (AOT), codegen (Inductor), kernel launch overhead (CUDA Graphs).
Concrete walkthrough
Stage 1: Dynamo — graph capture from Python bytecode
The hardest part of compiling PyTorch is that user code is Python, with arbitrary control flow, dictionary lookups, and dynamic shapes. Dynamo solves this by hooking into CPython’s frame evaluation API and rewriting bytecode on the fly:
- For each
forward()call, Dynamo runs the Python function and traces tensor ops into an FX graph. - When it hits Python code it can’t trace (a
print, a non-tensor branch), it inserts a graph break, runs the un-traceable bit in eager mode, then resumes tracing on the next tensor op. - The captured graph is canonical — same code runs same graph regardless of Python conditionals on tensor data.
@torch.compile
def f(x):
if x.sum() > 0: # tensor predicate — graph break here, both branches traced
return x * 2
return x + 1
# Dynamo emits two graphs (the two branches) and dispatches based on the actual tensor predicate at runtimeThis works far better than the previous torch.jit.trace (which silently traced one path and missed conditionals) or torch.jit.script (which required rewriting code to a Python subset).
Stage 2: AOTAutograd — joint forward+backward
Eager-mode autograd records the backward graph dynamically during the forward pass — fine for flexibility, terrible for compilers that want to optimize forward and backward together (so they can fuse a forward op with the activation it would have saved for backward).
AOTAutograd traces the forward and produces the functionalized joint graph (forward + backward together) ahead of time. Inductor then optimizes the whole thing.
This is why torch.compile speeds up training and not just inference — most of the wins live in fusing forward and backward together.
Stage 3: Inductor — fusion and codegen
Inductor’s job: take the graph, fuse what’s fusible, emit kernels.
Input graph:
y = (x + bias).relu()
z = y * weight
Naive eager: 3 kernels (add, relu, mul) — 3 reads + 3 writes from HBM.
Inductor: 1 fused kernel — 1 read of x + bias + weight, 1 write of z.Pointwise ops are fused aggressively. Reductions and matmuls are kept separate (those have different optimization regimes). On GPU, Inductor emits Triton for fused pointwise/reduction kernels and falls back to cuBLAS/CUTLASS for GEMM. On CPU, it emits C++ with OpenMP.
You can dump the generated code:
TORCH_LOGS=output_code python train.pyThis dumps every Triton kernel Inductor generated. Best way to learn Inductor: read its output.
Stage 4: CUDA Graph capture
Even after fusion, launching small kernels has overhead — each cudaLaunchKernel is ~5–10 µs. For a model with 100 small kernels per step, that’s 1 ms of pure launch overhead per step.
torch.compile(mode="reduce-overhead") or mode="max-autotune" enables CUDA graph capture: the whole step’s launches are recorded once into a graph, then replayed with one CPU API call. Eliminates launch overhead at the cost of static input shapes.
Real numbers — Llama-3.1 8B inference, RTX 4090
| Mode | tok/s (decode) | Speedup |
|---|---|---|
| Eager PyTorch | 32 | 1.0× |
torch.compile (default) | 51 | 1.6× |
torch.compile(mode="reduce-overhead") | 68 | 2.1× |
torch.compile(mode="max-autotune") | 75 | 2.3× |
max-autotune runs Inductor’s autotuner over GEMM tile shapes — adds compile time (a minute or two) but finds better kernels.
What torch.compile won’t do
- Won’t fuse across function-call boundaries that involve non-tensor Python (graph breaks).
- Won’t fix bad data layouts — if your model’s tensors are non-contiguous, Inductor produces non-contiguous-friendly kernels but they’re still slower than contiguous.
- Won’t replace specialized kernels — FlashAttention-3, Paged-Attention, custom Triton blow past Inductor’s defaults on attention. Inductor will use them if they’re available; otherwise it generates baseline kernels.
- Won’t accelerate dynamic-shape-heavy workloads — extreme shape variance triggers re-compilation. Use
dynamic=Trueor hard-code padding.
Run it in your browser — see fusion shrink kernel count
Inductor in production does much more sophisticated fusion (reductions, layout transformations, etc.), but the pattern — group pointwise runs into single fused kernels — is the heart of the win.
Quick check
Key takeaways
torch.compileis the daily-driver ML compiler in 2026. Default in PyTorch 2.5+; production-validated.- Dynamo + AOTAutograd + Inductor + CUDA Graphs — four stages, each addressing a different eager-mode pain point.
- The wins are real but uneven. 1.3–2× on training, 1.5–4× on inference, depending on how much eager mode was leaving on the table.
mode="reduce-overhead"andmode="max-autotune"exist — try them; modest extra compile time, often noticeable extra speedup.- Read Inductor’s output with
TORCH_LOGS=output_code— it’s the best way to learn how the compiler thinks.
Go deeper
- PaperPyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode TransformationThe PyTorch 2 paper. Authoritative reference for Dynamo + AOT + Inductor.
- DocsPyTorch — torch.compiler docsOperational reference. Modes, options, troubleshooting.
- BlogHorace He — What's up with PyTorch 2.0The PyTorch core dev who built much of this writing about how it works.
- VideoEdward Yang — Inside InductorAuthoritative talk on what Inductor actually does.
- DocsPyTorch — torch.compile troubleshootingWhen it doesn't speed up, this page is the first stop.
- Repopytorch/_inductor sourceThe actual codegen. Read `lowering.py` and the Triton template files.
- BlogPyTorch Blog — torch.compile + FSDP2 + Float8How torch.compile composes with the rest of the modern training stack.
- DocsTriton documentationThe DSL Inductor codegens to. Worth understanding once.