IREE & ExecuTorch
Prereqs: MLIR Overview, Dialects & Lowering. These are MLIR-based compilers in production.
Once your model leaves the GPU cluster and starts living on a phone, an AR headset, a robot, or a $30 microcontroller, you’re not running Python anymore. You’re running a binary the deployment compiler produced. and are how that binary gets made.
Both are -native ahead-of-time compilers: take a model graph, lower it through a stack of dialects, partition the work across the device’s available hardware (CPU, GPU, NPU), emit a small artifact you bundle in the app. ExecuTorch is PyTorch’s mobile-first answer (.pte files, torch.export integration, strongest iOS/Android story). IREE is the cross-framework alternative (StableHLO frontend, broadest hardware backend support). They’re also the bridge between the AI research stack and the device-vendor ecosystem — Apple, Qualcomm, AMD, ARM all plug their NPU support in as MLIR dialects on top of these compilers.
If you’re going to do edge AI work, knowing one of these two is the gate.
TL;DR
- IREE (Intermediate Representation Execution Environment) is Google’s open-source compiler + runtime. Compiles models (any frontend that emits StableHLO/MLIR) to a portable artifact that runs on CPU, GPU, NPU, edge silicon. Strongest production story for AMD, Apple, embedded.
- ExecuTorch is PyTorch’s mobile runtime. Takes a
torch.export()-d model graph, lowers through an MLIR-based path, produces a.pte(PyTorch ExecuTorch) artifact you ship in an app bundle. - Both compile ahead of time. There’s no Python at runtime; the deployable artifact is a tiny binary.
- Both are MLIR-native — they compose with the dialects you saw in MLIR Overview, and accept custom dialects from hardware vendors. This is how Apple, Qualcomm, and AMD ship their NPU support.
- For 2026 deployment: ExecuTorch is the path for PyTorch → mobile (Meta, OEM partners). IREE is the path for cross-framework, cross-hardware deployment, especially edge.
Mental model
Both stacks: model → MLIR → device-specific backend → AOT artifact. IREE is hardware-broader; ExecuTorch is PyTorch-native.
ExecuTorch — the PyTorch story
The mobile-friendly successor to PyTorch Mobile. Workflow:
import torch
from executorch.exir import to_edge
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
class MyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(128, 128)
def forward(self, x):
return torch.relu(self.linear(x))
model = MyModel().eval()
example_input = (torch.randn(1, 128),)
# Step 1: torch.export — captures a static FX graph.
exported = torch.export.export(model, example_input)
# Step 2: Lower to EXIR (Edge IR). Quantize if desired.
edge = to_edge(exported)
# Step 3: Pick backends. XNNPACK for CPU; Vulkan for Android GPU; Core ML for iOS NPU.
edge = edge.to_backend(XnnpackPartitioner())
# Step 4: Compile + serialize. Output is a .pte file.
program = edge.to_executorch()
with open("model.pte", "wb") as f:
f.write(program.buffer)The artifact is the .pte file — typically ~MB-scale, no Python, no dynamic-shape tracing at runtime. You load it via the C++ runtime (or one of the language bindings — Swift for iOS, Kotlin/Java for Android).
The partitioner decides which subgraph runs on which backend. For example: matmuls on Apple Neural Engine, control flow on CPU, custom op on a vendor NPU. Partitioning is done at compile time; the runtime just dispatches per-subgraph.
IREE — the cross-framework story
IREE accepts (from JAX, TF, ONNX, or PyTorch via torch-mlir), lowers through MLIR , and emits a self-contained “VM module” (.vmfb) plus a hardware-specific binary blob.
Workflow:
# 1. Get model into StableHLO. From JAX:
python -c "import jax; ... export.export(my_jax_function).export_to_stablehlo()"
# 2. Compile with IREE for a chosen backend.
iree-compile --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_90a my_model.mlirbc -o my_model.vmfb
# 3. Run via the IREE runtime.
iree-run-module --module=my_model.vmfb --device=cuda --function=forward --input=1x128xf32=...What iree-compile runs internally:
- Parse StableHLO.
- Convert to IREE’s input dialects.
- Run high-level optimizations (op , layout selection).
- to dispatch dialects (one dispatch per fused kernel).
- Lower per-dispatch to the target backend (LLVM-CPU, NVVM, ROCm, Vulkan/SPIR-V, custom).
- Serialize as VMFB.
Same MLIR pipeline as in Dialects & Lowering, just with IREE-specific high-level dialects on top.
Why two compilers exist
ExecuTorch was created (2023–2024) because PyTorch needed a mobile-first answer with tight integration to torch.export. IREE was created (2020) as a generally-portable AI runtime. They overlap but serve different audiences:
| ExecuTorch | IREE | |
|---|---|---|
| Frontend | PyTorch only | StableHLO (JAX/TF/ONNX/torch-mlir) |
| Mobile story | First-class | Good; less polished than ExecuTorch |
| Vendor NPU support | Fast (Meta partnerships) | Broader (any NPU vendor can write an MLIR backend) |
| Quantization | Built-in via torch.ao.quantization | Per-backend; less unified |
| Best fit | PyTorch + iOS / Android | Cross-framework + custom hardware |
In 2026: ExecuTorch dominates PyTorch-on-phone deployments (Meta uses it, OEM partners ship with it). IREE is the answer when you need a JAX model on a custom chip, or when you want one compiler to target many heterogeneous devices.
Quantization — where the deployable wins
Both compilers support int8 / int4 quantization integrated with the lowering pipeline:
- ExecuTorch + PT2E quantization: a calibration pass during
to_edge()produces a quantized graph with quantize/dequantize ops at boundaries; the partitioner decides which backend handles the quantized math. - IREE + StableHLO/MHLO quantization: similar story; integer ops travel down the lowering chain to the target.
A 4-bit Llama-3.2-3B compiled via ExecuTorch is ~2 GB and runs at ~15 tokens/sec on a recent iPhone Pro using the Core ML backend. The same model compiled via IREE for a Qualcomm Hexagon NPU lands at similar throughput on Snapdragon. The compiler is what makes this size and speed possible — the model file is just weights; the kernels that run them on each chip come from these MLIR pipelines.
What “deploying via IREE” looks like in production
The model gets compiled once, ahead of time, on a developer machine. The .vmfb ships with the app. The runtime is small (a few hundred KB statically linked), with no Python, no PyTorch. The runtime exposes a C API; you wrap it in Swift / Kotlin / Rust / whatever your app uses. This is roughly the same shape as the Serve & Ship capstone — but instead of llama.cpp doing the GGUF interpretation, IREE or ExecuTorch is doing the generated-kernel dispatch.
Run it in your browser — toy partitioning simulator
The shape (one backend per op, costs include transfer-on-switch) is precisely what ExecuTorch’s Partitioner and IREE’s dispatch-formation pass do. Real partitioners use much richer cost models (fused-kernel memory traffic, batch effects, hardware utilization), but the algorithm is structurally the same.
Quick check
Key takeaways
- IREE and ExecuTorch are the two MLIR-based deployment compilers. They take a research-framework model and produce a small, fast, no-Python deployable.
- ExecuTorch is PyTorch-native with the strongest mobile story. IREE is cross-framework with the broadest hardware backend support.
- Both AOT-compile. The model + kernels are baked into a single artifact; runtime is tiny and Python-free.
- The partitioner is the magic. It decides which subgraph runs on which backend (CPU, GPU, NPU, custom) given vendor constraints.
- Vendor NPU support arrives via MLIR backends. Apple, Qualcomm, AMD, and the long tail of edge silicon all plug in here.
Go deeper
- DocsExecuTorch DocumentationAuthoritative. The "Concepts" + "Quick Start" pages walk you from `torch.export` to a deployed `.pte` in 30 minutes.
- DocsIREE DocumentationUp-to-date. The "Getting Started" + "Compiler" sections cover the StableHLO → VMFB pipeline.
- PaperExecuTorch: From PyTorch to On-Device AIThe system paper. Section 3 has the partitioner cost model; section 5 has real deployment numbers across iOS, Android, and embedded.
- BlogPyTorch Blog — ExecuTorch AlphaOriginal announcement; useful for the design motivation.
- VideoIREE — Cross-Platform AI CompilationTalk by core IREE devs. Best motivation for "why a separate compiler from XLA".
- Repopytorch/executorch`backends/` for Apple/Vulkan/Hexagon/XNNPACK; `examples/` for end-to-end recipes.
- Repoiree-org/iree`compiler/src/iree/compiler/Codegen/` is the codegen pipeline; `samples/` has working PyTorch + JAX + ONNX examples.
Prereqs: MLIR Overview, Dialects & Lowering. These are MLIR-based compilers in production.
TL;DR
- IREE (Intermediate Representation Execution Environment) is Google’s open-source compiler + runtime. Compiles models (any frontend that emits StableHLO/MLIR) to a portable artifact that runs on CPU, GPU, NPU, edge silicon. Strongest production story for AMD, Apple, embedded.
- ExecuTorch is PyTorch’s mobile runtime. Takes a
torch.export()-d model graph, lowers through an MLIR-based path, produces a.pte(PyTorch ExecuTorch) artifact you ship in an app bundle. - Both compile ahead of time. There’s no Python at runtime; the deployable artifact is a tiny binary.
- Both are MLIR-native — they compose with the dialects you saw in MLIR Overview, and accept custom dialects from hardware vendors. This is how Apple, Qualcomm, and AMD ship their NPU support.
- For 2026 deployment: ExecuTorch is the path for PyTorch → mobile (Meta, OEM partners). IREE is the path for cross-framework, cross-hardware deployment, especially edge.
Why this matters
Once your model leaves the GPU cluster and starts living on a phone, an AR headset, a robot, or a $30 microcontroller, you’re not running Python. You’re running a binary the deployment compiler produced. IREE and ExecuTorch are how that binary gets made. They’re also the bridge between the AI research stack (PyTorch, JAX) and the device-vendor ecosystem (Apple Neural Engine, Qualcomm Hexagon, ARM Ethos, etc.) — vendors plug their NPU support in as MLIR dialects on top of these compilers.
If you’re going to do edge AI work, knowing one of these two is the gate.
Mental model
Both stacks: model → MLIR → device-specific backend → AOT artifact. IREE is hardware-broader; ExecuTorch is PyTorch-native.
Concrete walkthrough
ExecuTorch — the PyTorch story
The mobile-friendly successor to PyTorch Mobile. Workflow:
import torch
from executorch.exir import to_edge
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
class MyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(128, 128)
def forward(self, x):
return torch.relu(self.linear(x))
model = MyModel().eval()
example_input = (torch.randn(1, 128),)
# Step 1: torch.export — captures a static FX graph.
exported = torch.export.export(model, example_input)
# Step 2: Lower to EXIR (Edge IR). Quantize if desired.
edge = to_edge(exported)
# Step 3: Pick backends. XNNPACK for CPU; Vulkan for Android GPU; Core ML for iOS NPU.
edge = edge.to_backend(XnnpackPartitioner())
# Step 4: Compile + serialize. Output is a .pte file.
program = edge.to_executorch()
with open("model.pte", "wb") as f:
f.write(program.buffer)The artifact is the .pte file — typically ~MB-scale, no Python, no dynamic-shape tracing at runtime. You load it via the C++ runtime (or one of the language bindings — Swift for iOS, Kotlin/Java for Android).
The partitioner decides which subgraph runs on which backend. For example: matmuls on Apple Neural Engine, control flow on CPU, custom op on a vendor NPU. Partitioning is done at compile time; the runtime just dispatches per-subgraph.
IREE — the cross-framework story
IREE accepts StableHLO (from JAX, TF, ONNX, or PyTorch via torch-mlir), lowers through MLIR dialects, and emits a self-contained “VM module” (.vmfb) plus a hardware-specific binary blob.
Workflow:
# 1. Get model into StableHLO. From JAX:
python -c "import jax; ... export.export(my_jax_function).export_to_stablehlo()"
# 2. Compile with IREE for a chosen backend.
iree-compile --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_90a my_model.mlirbc -o my_model.vmfb
# 3. Run via the IREE runtime.
iree-run-module --module=my_model.vmfb --device=cuda --function=forward --input=1x128xf32=...What iree-compile runs internally:
- Parse StableHLO.
- Convert to IREE’s input dialects.
- Run high-level optimizations (op fusion, layout selection).
- Lower to dispatch dialects (one dispatch per fused kernel).
- Lower per-dispatch to the target backend (LLVM-CPU, NVVM, ROCm, Vulkan/SPIR-V, custom).
- Serialize as VMFB.
Same MLIR pipeline as in Dialects & Lowering, just with IREE-specific high-level dialects on top.
Why two compilers exist
ExecuTorch was created (2023–2024) because PyTorch needed a mobile-first answer with tight integration to torch.export. IREE was created (2020) as a generally-portable AI runtime. They overlap but serve different audiences:
| ExecuTorch | IREE | |
|---|---|---|
| Frontend | PyTorch only | StableHLO (JAX/TF/ONNX/torch-mlir) |
| Mobile story | First-class | Good; less polished than ExecuTorch |
| Vendor NPU support | Fast (Meta partnerships) | Broader (any NPU vendor can write an MLIR backend) |
| Quantization | Built-in via torch.ao.quantization | Per-backend; less unified |
| Best fit | PyTorch + iOS / Android | Cross-framework + custom hardware |
In 2026: ExecuTorch dominates PyTorch-on-phone deployments (Meta uses it, OEM partners ship with it). IREE is the answer when you need a JAX model on a custom chip, or when you want one compiler to target many heterogeneous devices.
Quantization — where the deployable wins
Both compilers support int8 / int4 quantization integrated with the lowering pipeline:
- ExecuTorch + PT2E quantization: a calibration pass during
to_edge()produces a quantized graph with quantize/dequantize ops at boundaries; the partitioner decides which backend handles the quantized math. - IREE + StableHLO/MHLO quantization: similar story; integer ops travel down the lowering chain to the target.
A 4-bit Llama-3.2-3B compiled via ExecuTorch is ~2 GB and runs at ~15 tokens/sec on a recent iPhone Pro using the Core ML backend. The same model compiled via IREE for a Qualcomm Hexagon NPU lands at similar throughput on Snapdragon. The compiler is what makes this size and speed possible — the model file is just weights; the kernels that run them on each chip come from these MLIR pipelines.
What “deploying via IREE” looks like in production
The model gets compiled once, ahead of time, on a developer machine. The .vmfb ships with the app. The runtime is small (a few hundred KB statically linked), with no Python, no PyTorch. The runtime exposes a C API; you wrap it in Swift / Kotlin / Rust / whatever your app uses. This is roughly the same shape as the Serve & Ship capstone — but instead of llama.cpp doing the GGUF interpretation, IREE or ExecuTorch is doing the generated-kernel dispatch.
Run it in your browser — toy partitioning simulator
The shape (one backend per op, costs include transfer-on-switch) is precisely what ExecuTorch’s Partitioner and IREE’s dispatch-formation pass do. Real partitioners use much richer cost models (fused-kernel memory traffic, batch effects, hardware utilization), but the algorithm is structurally the same.
Quick check
Key takeaways
- IREE and ExecuTorch are the two MLIR-based deployment compilers. They take a research-framework model and produce a small, fast, no-Python deployable.
- ExecuTorch is PyTorch-native with the strongest mobile story. IREE is cross-framework with the broadest hardware backend support.
- Both AOT-compile. The model + kernels are baked into a single artifact; runtime is tiny and Python-free.
- The partitioner is the magic. It decides which subgraph runs on which backend (CPU, GPU, NPU, custom) given vendor constraints.
- Vendor NPU support arrives via MLIR backends. Apple, Qualcomm, AMD, and the long tail of edge silicon all plug in here.
Go deeper
- DocsExecuTorch DocumentationAuthoritative. The "Concepts" + "Quick Start" pages walk you from `torch.export` to a deployed `.pte` in 30 minutes.
- DocsIREE DocumentationUp-to-date. The "Getting Started" + "Compiler" sections cover the StableHLO → VMFB pipeline.
- PaperExecuTorch: From PyTorch to On-Device AIThe system paper. Section 3 has the partitioner cost model; section 5 has real deployment numbers across iOS, Android, and embedded.
- BlogPyTorch Blog — ExecuTorch AlphaOriginal announcement; useful for the design motivation.
- VideoIREE — Cross-Platform AI CompilationTalk by core IREE devs. Best motivation for "why a separate compiler from XLA".
- Repopytorch/executorch`backends/` for Apple/Vulkan/Hexagon/XNNPACK; `examples/` for end-to-end recipes.
- Repoiree-org/iree`compiler/src/iree/compiler/Codegen/` is the codegen pipeline; `samples/` has working PyTorch + JAX + ONNX examples.