Skip to content

torch.compile (Inductor)

If you train or serve PyTorch models in 2026, your code is probably running through a compiler — even if you didn’t ask for one. torch.compile(model) is on by default in many training stacks, and almost every commercial PyTorch deployment in production today is going through whether the team realized it or not. It’s the most-deployed AI compiler in 2026 by orders of magnitude.

The pipeline has four stages, each addressing one of PyTorch’s eager-mode pain points. hooks into CPython’s bytecode evaluation to capture graphs from arbitrary Python code. AOTAutograd lifts the dynamic backward pass into a static joint forward-backward graph. Inductor does and emits kernels for GPU + C++ for CPU. captures the whole step’s launches and replays them with one CPU API call. Together: 1.3–2× speedup on training, 1.5–4× on inference, mostly free.

This lesson is what each stage actually does, what blocks the speedup, and how to debug when torch.compile doesn’t help as much as you expected.

TL;DR

  • torch.compile(model) swaps PyTorch’s eager execution for graph capture (Dynamo) + autograd compilation (AOTAutograd) + kernel codegen (Inductor).
  • Dynamo captures Python-level graphs by hooking into bytecode evaluation, falling back gracefully when it can’t trace.
  • AOTAutograd lifts the Python autograd into a static forward+backward joint graph that the compiler can optimize together.
  • Inductor lowers the graph to Triton kernels for GPU and C++ + OpenMP for CPU. Fuses pointwise ops aggressively.
  • Real-world wins: 1.3–2× speedup on training, 1.5–4× on inference, mostly from kernel fusion and CUDA Graph capture. The default in PyTorch 2.5+.

Mental model

Four compilation stages, each addressing one of PyTorch’s eager-mode pain points: capture (Dynamo), autograd (AOT), codegen (Inductor), kernel launch overhead (CUDA Graphs).

Stage 1: Dynamo — graph capture from Python bytecode

The hardest part of compiling PyTorch is that user code is Python, with arbitrary control flow, dictionary lookups, and dynamic shapes. Dynamo solves this by hooking into CPython’s frame evaluation API and rewriting bytecode on the fly:

  • For each forward() call, Dynamo runs the Python function and traces tensor ops into an FX graph.
  • When it hits Python code it can’t trace (a print, a non-tensor branch), it inserts a graph break, runs the un-traceable bit in eager mode, then resumes tracing on the next tensor op.
  • The captured graph is canonical — same code runs same graph regardless of Python conditionals on tensor data.
@torch.compile def f(x): if x.sum() > 0: # tensor predicate — graph break here, both branches traced return x * 2 return x + 1 # Dynamo emits two graphs (the two branches) and dispatches based on the actual tensor predicate at runtime

This works far better than the previous torch.jit.trace (which silently traced one path and missed conditionals) or torch.jit.script (which required rewriting code to a Python subset).

Stage 2: AOTAutograd — joint forward+backward

Eager-mode autograd records the backward graph dynamically during the forward pass — fine for flexibility, terrible for compilers that want to optimize forward and backward together (so they can fuse a forward op with the activation it would have saved for backward).

AOTAutograd traces the forward and produces the functionalized joint graph (forward + backward together) ahead of time. Inductor then optimizes the whole thing.

This is why torch.compile speeds up training and not just inference — most of the wins live in fusing forward and backward together.

Stage 3: Inductor — fusion and codegen

Inductor’s job: take the graph, fuse what’s fusible, emit kernels.

Input graph: y = (x + bias).relu() z = y * weight Naive eager: 3 kernels (add, relu, mul) — 3 reads + 3 writes from HBM. Inductor: 1 fused kernel — 1 read of x + bias + weight, 1 write of z.

Pointwise ops are fused aggressively. Reductions and matmuls are kept separate (those have different optimization regimes). On GPU, Inductor emits Triton for fused pointwise/reduction kernels and falls back to cuBLAS/CUTLASS for GEMM. On CPU, it emits C++ with OpenMP.

You can dump the generated code:

TORCH_LOGS=output_code python train.py

This dumps every Triton kernel Inductor generated. Best way to learn Inductor: read its output.

Stage 4: CUDA Graph capture

Even after fusion, launching small kernels has overhead — each cudaLaunchKernel is ~5–10 µs. For a model with 100 small kernels per step, that’s 1 ms of pure launch overhead per step.

torch.compile(mode="reduce-overhead") or mode="max-autotune" enables CUDA graph capture: the whole step’s launches are recorded once into a graph, then replayed with one CPU API call. Eliminates launch overhead at the cost of static input shapes.

Real numbers — Llama-3.1 8B inference, RTX 4090

Modetok/s (decode)Speedup
Eager PyTorch321.0×
torch.compile (default)511.6×
torch.compile(mode="reduce-overhead")682.1×
torch.compile(mode="max-autotune")752.3×

max-autotune runs Inductor’s autotuner over GEMM tile shapes — adds compile time (a minute or two) but finds better kernels.

What torch.compile won’t do

  • Won’t fuse across function-call boundaries that involve non-tensor Python (graph breaks).
  • Won’t fix bad data layouts — if your model’s tensors are non-contiguous, Inductor produces non-contiguous-friendly kernels but they’re still slower than contiguous.
  • Won’t replace specialized kernels — FlashAttention-3, Paged-Attention, custom Triton blow past Inductor’s defaults on attention. Inductor will use them if they’re available; otherwise it generates baseline kernels.
  • Won’t accelerate dynamic-shape-heavy workloads — extreme shape variance triggers re-compilation. Use dynamic=True or hard-code padding.

Run it in your browser — see fusion shrink kernel count

Python — editableBuild a tiny graph, simulate Inductor's pointwise fusion, count the kernels.
Ctrl+Enter to run

Inductor in production does much more sophisticated fusion (reductions, layout transformations, etc.), but the pattern — group pointwise runs into single fused kernels — is the heart of the win.

Quick check

Quick check
You add `torch.compile(model)` to a Llama-class model and see only ~10% speedup, far less than expected. What's the most likely cause?

Key takeaways

  1. torch.compile is the daily-driver ML compiler in 2026. Default in PyTorch 2.5+; production-validated.
  2. Dynamo + AOTAutograd + Inductor + CUDA Graphs — four stages, each addressing a different eager-mode pain point.
  3. The wins are real but uneven. 1.3–2× on training, 1.5–4× on inference, depending on how much eager mode was leaving on the table.
  4. mode="reduce-overhead" and mode="max-autotune" exist — try them; modest extra compile time, often noticeable extra speedup.
  5. Read Inductor’s output with TORCH_LOGS=output_code — it’s the best way to learn how the compiler thinks.

Go deeper

TL;DR

  • torch.compile(model) swaps PyTorch’s eager execution for graph capture (Dynamo) + autograd compilation (AOTAutograd) + kernel codegen (Inductor).
  • Dynamo captures Python-level graphs by hooking into bytecode evaluation, falling back gracefully when it can’t trace.
  • AOTAutograd lifts the Python autograd into a static forward+backward joint graph that the compiler can optimize together.
  • Inductor lowers the graph to Triton kernels for GPU and C++ + OpenMP for CPU. Fuses pointwise ops aggressively.
  • Real-world wins: 1.3–2× speedup on training, 1.5–4× on inference, mostly from kernel fusion and CUDA Graph capture. The default in PyTorch 2.5+.

Why this matters

torch.compile is the most-deployed ML compiler in 2026 — by orders of magnitude. Almost every commercial PyTorch deployment in production today is running through Inductor whether the team realized it or not. If you’re going to operate on PyTorch models, this is the compiler you’ll see in profiles and the one you’ll need to debug.

Mental model

Four compilation stages, each addressing one of PyTorch’s eager-mode pain points: capture (Dynamo), autograd (AOT), codegen (Inductor), kernel launch overhead (CUDA Graphs).

Concrete walkthrough

Stage 1: Dynamo — graph capture from Python bytecode

The hardest part of compiling PyTorch is that user code is Python, with arbitrary control flow, dictionary lookups, and dynamic shapes. Dynamo solves this by hooking into CPython’s frame evaluation API and rewriting bytecode on the fly:

  • For each forward() call, Dynamo runs the Python function and traces tensor ops into an FX graph.
  • When it hits Python code it can’t trace (a print, a non-tensor branch), it inserts a graph break, runs the un-traceable bit in eager mode, then resumes tracing on the next tensor op.
  • The captured graph is canonical — same code runs same graph regardless of Python conditionals on tensor data.
@torch.compile def f(x): if x.sum() > 0: # tensor predicate — graph break here, both branches traced return x * 2 return x + 1 # Dynamo emits two graphs (the two branches) and dispatches based on the actual tensor predicate at runtime

This works far better than the previous torch.jit.trace (which silently traced one path and missed conditionals) or torch.jit.script (which required rewriting code to a Python subset).

Stage 2: AOTAutograd — joint forward+backward

Eager-mode autograd records the backward graph dynamically during the forward pass — fine for flexibility, terrible for compilers that want to optimize forward and backward together (so they can fuse a forward op with the activation it would have saved for backward).

AOTAutograd traces the forward and produces the functionalized joint graph (forward + backward together) ahead of time. Inductor then optimizes the whole thing.

This is why torch.compile speeds up training and not just inference — most of the wins live in fusing forward and backward together.

Stage 3: Inductor — fusion and codegen

Inductor’s job: take the graph, fuse what’s fusible, emit kernels.

Input graph: y = (x + bias).relu() z = y * weight Naive eager: 3 kernels (add, relu, mul) — 3 reads + 3 writes from HBM. Inductor: 1 fused kernel — 1 read of x + bias + weight, 1 write of z.

Pointwise ops are fused aggressively. Reductions and matmuls are kept separate (those have different optimization regimes). On GPU, Inductor emits Triton for fused pointwise/reduction kernels and falls back to cuBLAS/CUTLASS for GEMM. On CPU, it emits C++ with OpenMP.

You can dump the generated code:

TORCH_LOGS=output_code python train.py

This dumps every Triton kernel Inductor generated. Best way to learn Inductor: read its output.

Stage 4: CUDA Graph capture

Even after fusion, launching small kernels has overhead — each cudaLaunchKernel is ~5–10 µs. For a model with 100 small kernels per step, that’s 1 ms of pure launch overhead per step.

torch.compile(mode="reduce-overhead") or mode="max-autotune" enables CUDA graph capture: the whole step’s launches are recorded once into a graph, then replayed with one CPU API call. Eliminates launch overhead at the cost of static input shapes.

Real numbers — Llama-3.1 8B inference, RTX 4090

Modetok/s (decode)Speedup
Eager PyTorch321.0×
torch.compile (default)511.6×
torch.compile(mode="reduce-overhead")682.1×
torch.compile(mode="max-autotune")752.3×

max-autotune runs Inductor’s autotuner over GEMM tile shapes — adds compile time (a minute or two) but finds better kernels.

What torch.compile won’t do

  • Won’t fuse across function-call boundaries that involve non-tensor Python (graph breaks).
  • Won’t fix bad data layouts — if your model’s tensors are non-contiguous, Inductor produces non-contiguous-friendly kernels but they’re still slower than contiguous.
  • Won’t replace specialized kernels — FlashAttention-3, Paged-Attention, custom Triton blow past Inductor’s defaults on attention. Inductor will use them if they’re available; otherwise it generates baseline kernels.
  • Won’t accelerate dynamic-shape-heavy workloads — extreme shape variance triggers re-compilation. Use dynamic=True or hard-code padding.

Run it in your browser — see fusion shrink kernel count

Python — editableBuild a tiny graph, simulate Inductor's pointwise fusion, count the kernels.
Ctrl+Enter to run

Inductor in production does much more sophisticated fusion (reductions, layout transformations, etc.), but the pattern — group pointwise runs into single fused kernels — is the heart of the win.

Quick check

Quick check
You add `torch.compile(model)` to a Llama-class model and see only ~10% speedup, far less than expected. What's the most likely cause?

Key takeaways

  1. torch.compile is the daily-driver ML compiler in 2026. Default in PyTorch 2.5+; production-validated.
  2. Dynamo + AOTAutograd + Inductor + CUDA Graphs — four stages, each addressing a different eager-mode pain point.
  3. The wins are real but uneven. 1.3–2× on training, 1.5–4× on inference, depending on how much eager mode was leaving on the table.
  4. mode="reduce-overhead" and mode="max-autotune" exist — try them; modest extra compile time, often noticeable extra speedup.
  5. Read Inductor’s output with TORCH_LOGS=output_code — it’s the best way to learn how the compiler thinks.

Go deeper