Apple Neural Engine
In a managed-runtime cloud world, “the GPU runs my model” is a complete enough sentence. PyTorch dispatches to CUDA, CUDA dispatches to a tensor core, and you mostly just need to know that the tensor core exists. The runtime hides the silicon’s idiosyncrasies; the docs tell you everything you need.
The is the opposite case. It’s the most-shipped NPU in human history — it lives on every iPhone since 2017 and every Apple Silicon Mac, around two billion devices in total — and Apple has never publicly documented its op set. What runs on the ANE is decided empirically: the Core ML compiler partitions your graph at first inference, and the only ground truth is what shows up in the Instruments profiler. The Python is the SDK; the C++ Core ML compiler running on the user’s device is the thing that decides where each op lands.
So this lesson is the field guide. Not “what does the ANE do” in the abstract, but “what does the ANE actually accept, what makes it bail, and how do you read the trace that tells you which it did?”
TL;DR
- The Apple Neural Engine (ANE) is a fixed-function NPU on every iPhone since A11 (2017) and every M-series Mac. 16-core in A17/M3 at ~35 TOPS (INT8); M4 bumps the same 16-core ANE to ~38 TOPS. Lives next to the CPU and GPU on the same SoC die. (When you see “100+ TOPS” in Apple marketing, that’s the full SoC aggregate across CPU + GPU + ANE; the ANE silicon itself is ~38 TOPS.)
- Apple does not document the ANE op set publicly. What runs on ANE is determined empirically — Core ML’s compiler decides, you observe via Instruments.
- ANE prefers: convs, matmuls, FP16 attention, INT8 quantized weight-only, fixed-shape inputs, and tensors with fewer than 16K elements per dimension. It rejects: dynamic shapes, exotic activations (gelu-fast variants), some attention layouts, large embedding tables.
- Models compiled with
compute_units=ALLautomatically partition between ANE / GPU / CPU. Usecompute_units=CPU_AND_NEto force “ANE or fail” — the right setting for benchmarking. - The ANE profiler (Xcode → Instruments → Core ML template) is the only ground truth. It shows per-op compute-unit assignment and timing. Anything else is guessing.
Why this matters
The ANE is the most-shipped NPU in human history (~2 billion devices). Every iOS Intelligence feature, every on-device Core ML model, every iPhone camera “magic” runs on the ANE when it can. If you ship anything on iOS, your model lands on the ANE or it doesn’t — and the difference is 5–10× in performance per watt.
But the ANE is also the worst-documented NPU. Apple’s official guidance is “use Core ML and trust the compiler.” For 80% of models that works. The remaining 20% — your model — needs you to know the actual rules.
Mental model
Three things to internalize:
- The compiler decides at first inference, not at convert time. A model that looks ANE-ready at conversion can still bail to GPU at runtime.
- Partitioning is per-op. A 32-layer transformer might run 30 layers on ANE and 2 (with the layer-norm Apple doesn’t like) on GPU. Total throughput = throughput of slowest-class section + transfer overhead.
- The ANE has its own SRAM (~16 MB on M3-class). Tensors that don’t fit get spilled to system memory through the ANE’s narrow port — drastically slowing things down. This is the “16K element cap” gotcha.
What the ANE actually is
The ANE is an array of fixed-function matrix-multiply-add tiles. Each tile does an INT8 or FP16 mac in one cycle. The hardware supports a small set of ops natively:
- Convolution (conv2d, depthwise conv, transposed conv)
- Matmul / dense
- Element-wise: add, multiply, ReLU, sigmoid, tanh
- Layer norm (sometimes — depends on dims)
- Softmax (sometimes)
- Attention (sometimes — see below)
Anything else: bail to GPU/CPU. This is why running “MobileNet” on ANE is trivial (all conv) but running “Llama” is finicky (the attention block has shape patterns that fall in and out of the supported set).
The 16K-element cap
Each tensor dimension above ~16K elements forces ANE to split the op into multiple passes (re-tiling through SRAM). For LLM context that means:
- Sequence length 2048 with hidden dim 4096 = 8.4M elements per tensor. ANE handles fine via internal tiling.
- Sequence length 16384 with hidden dim 4096 = the seq dim crosses 16K — the attention’s
seq×seqscore matrix becomes 16K×16K. ANE refuses or runs at a small fraction of peak.
For this reason, Llama-style long-context (over 4K) typically falls back to GPU on iPhone, even when the model “fits”. This is expected behavior — don’t chase it.
Quantization for ANE
ANE prefers FP16 throughout. Apple’s compiler can lower FP16 to ANE; FP32 won’t go to ANE.
For weight-only quantization, the supported scheme is offline Python:
import coremltools.optimize.coreml as cto
quantized_model = cto.linear_quantize_weights(
mlmodel,
config=cto.OptimizationConfig(
global_config=cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
)
)Symmetric, per-tensor INT8 weight-only quantization is the ANE-friendly default. The ANE dequantizes weights to FP16 on the fly. This gets you ~50% memory savings without losing ANE residency.
What does not work: per-channel quantization (ANE supports per-tensor only on most paths), INT4 weight quantization (unsupported as of Core ML 7), mixed-precision activation quantization (FP16 activations is the rule).
Attention on ANE
The transformer attention block is the make-or-break for LLMs on ANE.
What lands on ANE:
- QKV projections (matmul → ANE).
- Scaled-dot-product with fixed sequence length — typically chunked to 256 or 512 token windows.
- The output projection (matmul → ANE).
What bails to GPU:
- Dynamic-length attention (variable sequence per call).
- Long-context attention (seq over 1024 token at d_model over 4096 → exceeds tile budget).
- Custom attention masks (some patterns).
The MLX team’s iOS LLM examples typically chunk the into fixed-size windows specifically to keep attention on ANE. This is the LLM-specific ANE optimization.
Profiling — the only source of truth
Open Xcode → Product → Profile (⌘I) → choose the “Core ML” template. Run your app on a real device. The trace shows:
- A timeline per compute unit (ANE / GPU / CPU).
- For each Core ML inference call, a per-op breakdown showing which unit each ran on.
- Power per unit.
This is the only way to know what actually ran on ANE. Documentation can’t tell you because Apple’s compiler decisions change between iOS versions.
Reading the trace
Common patterns in Instruments:
| Pattern | Meaning | What to do |
|---|---|---|
| 100% ANE for the whole model | Ideal | Ship it |
| 95% ANE, 5% GPU spike at the layer norm | Layer norm fell back; usually fine | If model is fast enough, leave it |
| 70% ANE, 30% GPU alternating | Attention or some structural op bailing | Try chunking attention; check seq length |
0% ANE despite compute_units=ALL | Whole model rejected by ANE compiler | Re-check FP16 conversion; check for dynamic shapes |
| ANE active but throughput is low | Likely 16K-cap tiling overhead | Reduce hidden dim or sequence length |
A typical conversion + verification flow
The conversion is offline Python (your dev machine). The verification is on-device, in Instruments.
import torch
import coremltools as ct
# 1. Trace the model with sample inputs
model = MyModel()
example = torch.randn(1, 512, 768)
traced = torch.jit.trace(model, example)
# 2. Convert with FP16 (ANE precision)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 512, 768))],
compute_precision=ct.precision.FLOAT16,
compute_units=ct.ComputeUnit.CPU_AND_NE, # force ANE or fail
convert_to="mlprogram",
minimum_deployment_target=ct.target.iOS17,
)
# 3. (Optional) INT8 weight-only quantization
import coremltools.optimize.coreml as cto
mlmodel = cto.linear_quantize_weights(
mlmodel,
config=cto.OptimizationConfig(
global_config=cto.OpLinearQuantizerConfig(mode="linear_symmetric")
),
)
mlmodel.save("MyModel.mlpackage")Then in your iOS app, run the model once, open Instruments, confirm ANE residency. No model is “shipped to ANE” until you’ve seen it in Instruments.
When to use ANE vs GPU vs CPU
| Workload | Best target |
|---|---|
| MobileNet / EfficientNet / image classification | ANE — easy 100% residency |
| Whisper.cpp speech | GPU (Metal) — Whisper’s attention falls outside ANE’s friendly zone |
| Llama-3.2-1B / 3B chat | GPU (Metal via llama.cpp) — same reason |
| Stable Diffusion image gen | Mixed — UNet on ANE, decoder on GPU |
| Object detection (YOLO) | ANE for the backbone, CPU for NMS |
| Custom CNN | ANE almost always |
The general heuristic: classic CNN-shaped workloads → ANE. Modern LLM-shaped workloads → GPU. ANE is best on the workloads that defined “mobile ML” pre-2023; LLMs broke many of its assumptions.
Run it in your browser
A useful demo: simulate the per-op partitioning decision the ANE compiler makes. Lets you see why some models hit 100% ANE and others hit 60%.
The pattern is the answer to “why does my LLM’s ANE residency drop when I bump max_context?” — attention crosses the cap.
Quick check
Key takeaways
- The ANE is the most-shipped NPU on Earth (~2 B devices) but its op set is undocumented; everything is empirical.
compute_units=CPU_AND_NEfor benchmarking;compute_units=ALLfor production. Always verify residency in Instruments.- The 16K-element cap breaks long-context attention; expect LLMs over 4K context to fall back to GPU. This is normal.
- FP16 throughout, INT8 weight-only via
linear_quantize_weights, symmetric per-tensor is the ANE-friendly quantization recipe. - CNN-shaped workloads → ANE. LLM-shaped workloads → GPU. The original 2017 ANE design predates the transformer era.
- One unsupported op can sink whole-model ANE residency — be conservative with custom activations.
Go deeper
- DocsCore ML DocumentationOfficial guidance — useful but deliberately abstract about ANE specifics. Read the "Performance" section.
- DocsCore ML Tools — Optimization GuideQuantization recipes specifically for ANE residency.
- BlogDeploying Transformers on the Apple Neural EngineApple's own guide to making transformers ANE-friendly. Cited in every iOS LLM project.
- Blogapple/ml-ane-transformers (reference impl)The reference attention rewrite that maximizes ANE residency. Read the README + the attention diff.
- PaperiLLM: Efficient On-Device LLM Inference on Apple Neural EngineApple's recent paper on quantization and chunking to keep LLMs on ANE. The frontier reference for 2026.
- VideoWWDC24 — Bring Your ML Model to Apple SiliconThe 2024 WWDC session — what's new in coremltools 7, the optimization workflow.
- DocsProfiling ML Models with InstrumentsHow to read the Core ML trace in Instruments — the only source of ground truth.
TL;DR
- The Apple Neural Engine (ANE) is a fixed-function NPU on every iPhone since A11 (2017) and every M-series Mac. 16-core in A17/M3 at ~35 TOPS (INT8); M4 bumps the same 16-core ANE to ~38 TOPS. Lives next to the CPU and GPU on the same SoC die. (When you see “100+ TOPS” in Apple marketing, that’s the full SoC aggregate across CPU + GPU + ANE; the ANE silicon itself is ~38 TOPS.)
- Apple does not document the ANE op set publicly. What runs on ANE is determined empirically — Core ML’s compiler decides, you observe via Instruments.
- ANE prefers: convs, matmuls, FP16 attention, INT8 quantized weight-only, fixed-shape inputs, and tensors with fewer than 16K elements per dimension. It rejects: dynamic shapes, exotic activations (gelu-fast variants), some attention layouts, large embedding tables.
- Models compiled with
compute_units=ALLautomatically partition between ANE / GPU / CPU. Usecompute_units=CPU_AND_NEto force “ANE or fail” — the right setting for benchmarking. - The ANE profiler (Xcode → Instruments → Core ML template) is the only ground truth. It shows per-op compute-unit assignment and timing. Anything else is guessing.
Why this matters
The ANE is the most-shipped NPU in human history (~2 billion devices). Every iOS Intelligence feature, every on-device Core ML model, every iPhone camera “magic” runs on the ANE when it can. If you ship anything on iOS, your model lands on the ANE or it doesn’t — and the difference is 5–10× in performance per watt.
But the ANE is also the worst-documented NPU. Apple’s official guidance is “use Core ML and trust the compiler.” For 80% of models that works. The remaining 20% — your model — needs you to know the actual rules.
Mental model
Three things to internalize:
- The compiler decides at first inference, not at convert time. A model that looks ANE-ready at conversion can still bail to GPU at runtime.
- Partitioning is per-op. A 32-layer transformer might run 30 layers on ANE and 2 (with the layer-norm Apple doesn’t like) on GPU. Total throughput = throughput of slowest-class section + transfer overhead.
- The ANE has its own SRAM (~16 MB on M3-class). Tensors that don’t fit get spilled to system memory through the ANE’s narrow port — drastically slowing things down. This is the “16K element cap” gotcha.
Concrete walkthrough
What the ANE actually is
The ANE is an array of fixed-function matrix-multiply-add tiles. Each tile does an INT8 or FP16 mac in one cycle. The hardware supports a small set of ops natively:
- Convolution (conv2d, depthwise conv, transposed conv)
- Matmul / dense
- Element-wise: add, multiply, ReLU, sigmoid, tanh
- Layer norm (sometimes — depends on dims)
- Softmax (sometimes)
- Attention (sometimes — see below)
Anything else: bail to GPU/CPU. This is why running “MobileNet” on ANE is trivial (all conv) but running “Llama” is finicky (the attention block has shape patterns that fall in and out of the supported set).
The 16K-element cap
Each tensor dimension above ~16K elements forces ANE to split the op into multiple passes (re-tiling through SRAM). For LLM context that means:
- Sequence length 2048 with hidden dim 4096 = 8.4M elements per tensor. ANE handles fine via internal tiling.
- Sequence length 16384 with hidden dim 4096 = the seq dim crosses 16K — the attention’s
seq×seqscore matrix becomes 16K×16K. ANE refuses or runs at a small fraction of peak.
For this reason, Llama-style long-context (over 4K) typically falls back to GPU on iPhone, even when the model “fits”. This is expected behavior — don’t chase it.
Quantization for ANE
ANE prefers FP16 throughout. Apple’s compiler can lower FP16 to ANE; FP32 won’t go to ANE.
For weight-only quantization, the supported scheme is:
import coremltools.optimize.coreml as cto
quantized_model = cto.linear_quantize_weights(
mlmodel,
config=cto.OptimizationConfig(
global_config=cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512)
)
)Symmetric, per-tensor INT8 weight-only quantization is the ANE-friendly default. The ANE dequantizes weights to FP16 on the fly. This gets you ~50% memory savings without losing ANE residency.
What does not work: per-channel quantization (ANE supports per-tensor only on most paths), INT4 weight quantization (unsupported as of Core ML 7), mixed-precision activation quantization (FP16 activations is the rule).
Attention on ANE
The transformer attention block is the make-or-break for LLMs on ANE.
What lands on ANE:
- QKV projections (matmul → ANE).
- Scaled-dot-product with fixed sequence length — typically chunked to 256 or 512 token windows.
- The output projection (matmul → ANE).
What bails to GPU:
- Dynamic-length attention (variable sequence per call).
- Long-context attention (seq over 1024 token at d_model over 4096 → exceeds tile budget).
- Custom attention masks (some patterns).
The MLX team’s iOS LLM examples typically chunk the KV cache into fixed-size windows specifically to keep attention on ANE. This is the LLM-specific ANE optimization.
Profiling — the only source of truth
Open Xcode → Product → Profile (⌘I) → choose the “Core ML” template. Run your app on a real device. The trace shows:
- A timeline per compute unit (ANE / GPU / CPU).
- For each Core ML inference call, a per-op breakdown showing which unit each ran on.
- Power per unit.
This is the only way to know what actually ran on ANE. Documentation can’t tell you because Apple’s compiler decisions change between iOS versions.
Reading the trace
Common patterns in Instruments:
| Pattern | Meaning | What to do |
|---|---|---|
| 100% ANE for the whole model | Ideal | Ship it |
| 95% ANE, 5% GPU spike at the layer norm | Layer norm fell back; usually fine | If model is fast enough, leave it |
| 70% ANE, 30% GPU alternating | Attention or some structural op bailing | Try chunking attention; check seq length |
0% ANE despite compute_units=ALL | Whole model rejected by ANE compiler | Re-check FP16 conversion; check for dynamic shapes |
| ANE active but throughput is low | Likely 16K-cap tiling overhead | Reduce hidden dim or sequence length |
A typical conversion + verification flow
import torch
import coremltools as ct
# 1. Trace the model with sample inputs
model = MyModel()
example = torch.randn(1, 512, 768)
traced = torch.jit.trace(model, example)
# 2. Convert with FP16 (ANE precision)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 512, 768))],
compute_precision=ct.precision.FLOAT16,
compute_units=ct.ComputeUnit.CPU_AND_NE, # force ANE or fail
convert_to="mlprogram",
minimum_deployment_target=ct.target.iOS17,
)
# 3. (Optional) INT8 weight-only quantization
import coremltools.optimize.coreml as cto
mlmodel = cto.linear_quantize_weights(
mlmodel,
config=cto.OptimizationConfig(
global_config=cto.OpLinearQuantizerConfig(mode="linear_symmetric")
),
)
mlmodel.save("MyModel.mlpackage")Then in your iOS app, run the model once, open Instruments, confirm ANE residency. No model is “shipped to ANE” until you’ve seen it in Instruments.
When to use ANE vs GPU vs CPU
| Workload | Best target |
|---|---|
| MobileNet / EfficientNet / image classification | ANE — easy 100% residency |
| Whisper.cpp speech | GPU (Metal) — Whisper’s attention falls outside ANE’s friendly zone |
| Llama-3.2-1B / 3B chat | GPU (Metal via llama.cpp) — same reason |
| Stable Diffusion image gen | Mixed — UNet on ANE, decoder on GPU |
| Object detection (YOLO) | ANE for the backbone, CPU for NMS |
| Custom CNN | ANE almost always |
The general heuristic: classic CNN-shaped workloads → ANE. Modern LLM-shaped workloads → GPU. ANE is best on the workloads that defined “mobile ML” pre-2023; LLMs broke many of its assumptions.
Run it in your browser
A useful demo: simulate the per-op partitioning decision the ANE compiler makes. Lets you see why some models hit 100% ANE and others hit 60%.
The pattern is the answer to “why does my LLM’s ANE residency drop when I bump max_context?” — attention crosses the cap.
Quick check
Key takeaways
- The ANE is the most-shipped NPU on Earth (~2 B devices) but its op set is undocumented; everything is empirical.
compute_units=CPU_AND_NEfor benchmarking;compute_units=ALLfor production. Always verify residency in Instruments.- The 16K-element cap breaks long-context attention; expect LLMs over 4K context to fall back to GPU. This is normal.
- FP16 throughout, INT8 weight-only via
linear_quantize_weights, symmetric per-tensor is the ANE-friendly quantization recipe. - CNN-shaped workloads → ANE. LLM-shaped workloads → GPU. The original 2017 ANE design predates the transformer era.
- One unsupported op can sink whole-model ANE residency — be conservative with custom activations.
Go deeper
- DocsCore ML DocumentationOfficial guidance — useful but deliberately abstract about ANE specifics. Read the "Performance" section.
- DocsCore ML Tools — Optimization GuideQuantization recipes specifically for ANE residency.
- BlogDeploying Transformers on the Apple Neural EngineApple's own guide to making transformers ANE-friendly. Cited in every iOS LLM project.
- Blogapple/ml-ane-transformers (reference impl)The reference attention rewrite that maximizes ANE residency. Read the README + the attention diff.
- PaperiLLM: Efficient On-Device LLM Inference on Apple Neural EngineApple's recent paper on quantization and chunking to keep LLMs on ANE. The frontier reference for 2026.
- VideoWWDC24 — Bring Your ML Model to Apple SiliconThe 2024 WWDC session — what's new in coremltools 7, the optimization workflow.
- DocsProfiling ML Models with InstrumentsHow to read the Core ML trace in Instruments — the only source of ground truth.