Core ML & ANE
Prereqs: llama.cpp Internals, ExecuTorch. This lesson is the Apple-specific deep dive.
In a cloud stack, “what hardware are we running on?” gets answered once at provisioning time — H100s, A100s, TPUs — and the Python serving code never thinks about it again. The runtime hides the chip.
Apple does the opposite. An iPhone, an iPad, and a Mac all share roughly the same software stack, but each has its own mix of CPU cores, GPU cores, and a dedicated (ANE) — Apple’s NPU. The same .mlpackage you ship has to run efficiently on all of them. So Apple split the deployment story in two: an offline conversion step (Python coremltools, runs on your dev machine) and an on-device runtime (C++/Swift, decides per-op which silicon to use). The conversion ships a graph; the device picks the engine.
The cloud reflex of “just write Python and trust the runtime” is exactly the wrong instinct here. The Python is the SDK; what runs on the device is the Core ML runtime, and the ANE has opinions. This lesson is about those opinions — what runs on the ANE, what falls back, and what the iOS Intelligence stack actually does.
TL;DR
- Core ML is Apple’s on-device ML framework. Models are
.mlpackage(the modern format, replaces.mlmodel). Run on CPU, GPU (via Metal), or Apple Neural Engine (ANE) — Apple’s NPU. The framework dispatches per-op based on what the chosen compute unit can do. - The Apple Neural Engine is the silicon that Apple’s iOS Intelligence runs on. ~35 TOPS at INT8 on A17/M3, ~38 TOPS on M4. (The “100+ TOPS” you see in Apple marketing is the full-SoC aggregate across CPU + GPU + ANE, not the ANE alone.) Privacy-preserving (on-device, never the cloud). Heavily optimized for transformer inference in M3/M4 generation.
coremltoolsis the conversion toolkit: PyTorch / TensorFlow / ONNX →.mlpackage. The 2024+ version supportstorch.export-shaped graphs and palettized (LUT-quantized) weights.- MLX is Apple’s research-grade framework — like PyTorch but Apple-native, lazy-evaluated, runs on Apple silicon’s unified memory. Used for training and prototyping; deploy via Core ML for production.
- For 2026 iOS apps: Core ML is the path when you need ANE; ExecuTorch (with Core ML delegate) is the path when authoring fluency wins; llama.cpp is the path when binary size and ANE-independence matter.
Why this matters
Apple ships AI to >2 billion devices. iOS Intelligence (rolled out 2024–2025) runs entirely on Core ML + ANE for the on-device portion. Every iOS app that wants to run a model with the lowest power and the least visible thermal cost goes through this stack. Knowing Core ML is non-optional for iOS-mobile-AI work, and the ANE programming model — the constraints, the tooling, what it accelerates — is something the rest of the industry doesn’t generalize from.
Mental model
The conversion is offline; the dispatch is per-op at runtime; the developer’s lever is the compute unit selection (.cpuOnly, .cpuAndGPU, .all).
Convert a PyTorch model
The conversion is offline Python — author surface, runs on your dev machine, never on the device.
import torch
import coremltools as ct
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(768, 768)
def forward(self, x):
return torch.relu(self.linear(x))
m = M().eval()
example_input = torch.randn(1, 768)
traced = torch.jit.trace(m, example_input) # or torch.export for newer flow
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(name="x", shape=example_input.shape)],
convert_to="mlprogram", # .mlpackage format
compute_precision=ct.precision.FLOAT16, # ANE prefers FP16
minimum_deployment_target=ct.target.iOS17,
)
mlmodel.save("my_model.mlpackage")Things to know:
convert_to="mlprogram"produces the modern format that supports ANE properly. The older"neuralnetwork"format is legacy.- FP16 is what the ANE wants. Native FP32 mostly falls back to GPU. INT8 / palettized weights work but with caveats.
minimum_deployment_targetmatters because newer iOS versions add ops; older models can run on newer targets but not vice-versa.
mlmodelc — the device-side compile
.mlpackage is the source-of-truth artifact. Before runtime it gets compiled into .mlmodelc (Core ML compiled) — a directory of binary blobs optimized for the specific chip the user’s device has. Xcode does this for you when you build the app:
- For an iPhone 15 Pro target:
.mlmodelcincludes ANE-optimized weight layouts. - For a Mac M3: includes M3-tuned compute graph.
- For older iPhone SE: falls back to CPU-friendly layout.
A single .mlpackage produces multiple .mlmodelc variants — one per supported architecture — bundled into the app. iOS picks the right one at install time.
Loading and running
The Swift API is the user surface — what your iOS app actually calls.
import CoreML
let model = try! MyModel(configuration: MLModelConfiguration())
let input = MyModelInput(x: try! MLMultiArray(shape: [1, 768], dataType: .float16))
let output = try! model.prediction(input: input)
// Or with manual compute-unit selection:
let config = MLModelConfiguration()
config.computeUnits = .all // try ANE first, fall back to GPU/CPU
let model2 = try! MyModel(configuration: config)computeUnits = .all means “use whatever is fastest for this graph.” .cpuOnly is for testing; .cpuAndNeuralEngine is for “force-ANE-or-CPU” if you want to skip GPU.
What runs on ANE (and what doesn’t)
The ANE is fast but picky. As of M4 / A18 generation:
Yes:
- Conv (1×1, 3×3, depthwise) at FP16
- Matmul at FP16 (and now also at INT4 with palettized weights)
- Pointwise (relu, gelu, sigmoid, layernorm at common shapes)
- Common transformer patterns (Q/K/V projections, attention with static cache)
No / falls back:
- Dynamic shapes (the ANE wants compile-time shapes)
- FP32 (mostly falls to GPU)
- Operations on unaligned memory layouts
- Some attention variants with unusual masking
- Certain reductions / softmax in specific shapes
Apple publishes a list of ANE-friendly op patterns; coremltools.optimize includes passes that rewrite eligible subgraphs to the ANE-friendly form.
The practical recipe: convert with compute_precision=FP16 and compute_units=.all, then check Instruments’ Core ML profiler to see which ops actually landed on ANE. If too many fall back, rewrite the model’s attention / softmax to use Core ML’s native KV-cache pattern (added in 2024+).
Palettization — Apple’s quantization
Core ML supports palettization: instead of storing weights as INT4/INT8, store them as lookup-table indices. Each tensor has a small palette (e.g., 16 unique FP16 values), and weights are 4-bit indices into the palette.
from coremltools.optimize.coreml import OpPalettizerConfig, palettize_weights
config = OpPalettizerConfig(mode="kmeans", n_bits=4)
palettized = palettize_weights(mlmodel, config)Why palettization vs INT4? The ANE has hardware support for LUT lookups during convolution. Palettized weights run natively on ANE; INT4 weights must dequantize-to-FP16 in software. So for ANE-targeted models, palettize, not quantize.
MLX — the research framework
MLX is Apple’s NumPy-flavored framework, optimized for Apple Silicon. Runs on unified memory (CPU and GPU share the same RAM, no copies). Lazy evaluation, similar to JAX. Used for research-grade work and as a faster prototyping path than PyTorch on Apple hardware.
import mlx.core as mx
import mlx.nn as nn
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(768, 768)
def __call__(self, x):
return mx.maximum(self.linear(x), 0)
model = MLP()
x = mx.random.normal((1, 768))
y = model(x) # lazy
mx.eval(y) # forces evaluationFor deployment to iOS production, you typically train in MLX (or PyTorch), convert via coremltools, ship as .mlpackage. MLX is the Apple-research-stack equivalent of JAX, not a deployment runtime.
iOS Intelligence — what’s running
Apple’s on-device Intelligence (Apple Intelligence, 2024+) runs:
- Generalized rewriting / summarization: a ~3B distilled model on ANE.
- Image generation (Genmoji, Image Playground): on-device diffusion models.
- Notification summaries: small LM on ANE.
- The Cloud-side larger models route through Private Cloud Compute when on-device isn’t enough.
The on-device portion uses Core ML + ANE end-to-end. Reading the WWDC sessions on this is unusually instructive for any production edge-AI work.
Run it in your browser — model-size simulator
The shape — INT4-palettized 3B as the iOS-app-bundleable sweet spot — matches Apple’s own intelligence recipe for the ~3B models that ship in the OS.
Quick check
Key takeaways
- Core ML = Apple’s on-device ML framework.
.mlpackagesource,.mlmodelcdevice-compiled, runs on CPU/GPU/ANE. - The ANE is Apple’s NPU — fast, picky. Wants FP16, static shapes, ANE-friendly patterns.
- Palettization, not generic INT4, for ANE-targeted models. Hardware LUT lookups.
- MLX is the research framework, Core ML is the deployment path. Same as JAX vs IREE in spirit.
- iOS Intelligence runs on this stack. Reading the WWDC sessions is the highest-signal preparation.
Go deeper
- DocsApple Developer — Core MLAuthoritative API reference. The "Run a model" + "Optimize a Core ML model" guides are the right starting point.
- Docscoremltools DocumentationThe conversion toolkit. The 2024+ docs cover torch.export-shaped graphs and palettization.
- DocsApple — Core ML HubSample apps, guides, the "deploy a transformer" tutorial.
- VideoWWDC 2024 — On-Device ML SessionsApple's own talks on iOS Intelligence, ANE optimization, and Core ML 2024 features. The single most useful video corpus for this lesson.
- DocsMLX DocumentationApple's research framework. The README + examples are enough to start prototyping.
- BlogApple — Introducing the Apple Foundation ModelsApple's system paper on the on-device + Private Cloud Compute architecture. Read for the production design.
- Repoapple/ml-stable-diffusionApple's own SD reference. Best worked example of Core ML + ANE for a non-LLM model.
Prereqs: llama.cpp Internals, ExecuTorch. This lesson is the Apple-specific deep dive.
TL;DR
- Core ML is Apple’s on-device ML framework. Models are
.mlpackage(the modern format, replaces.mlmodel). Run on CPU, GPU (via Metal), or Apple Neural Engine (ANE) — Apple’s NPU. The framework dispatches per-op based on what the chosen compute unit can do. - The Apple Neural Engine is the silicon that Apple’s iOS Intelligence runs on. ~35 TOPS at INT8 on A17/M3, ~38 TOPS on M4. (The “100+ TOPS” you see in Apple marketing is the full-SoC aggregate across CPU + GPU + ANE, not the ANE alone.) Privacy-preserving (on-device, never the cloud). Heavily optimized for transformer inference in M3/M4 generation.
coremltoolsis the conversion toolkit: PyTorch / TensorFlow / ONNX →.mlpackage. The 2024+ version supportstorch.export-shaped graphs and palettized (LUT-quantized) weights.- MLX is Apple’s research-grade framework — like PyTorch but Apple-native, lazy-evaluated, runs on Apple silicon’s unified memory. Used for training and prototyping; deploy via Core ML for production.
- For 2026 iOS apps: Core ML is the path when you need ANE; ExecuTorch (with Core ML delegate) is the path when authoring fluency wins; llama.cpp is the path when binary size and ANE-independence matter.
Why this matters
Apple ships AI to >2 billion devices. iOS Intelligence (rolled out 2024–2025) runs entirely on Core ML + ANE for the on-device portion. Every iOS app that wants to run a model with the lowest power and the least visible thermal cost goes through this stack. Knowing Core ML is non-optional for iOS-mobile-AI work, and the ANE programming model — the constraints, the tooling, what it accelerates — is something the rest of the industry doesn’t generalize from.
Mental model
The conversion is offline; the dispatch is per-op at runtime; the developer’s lever is the compute unit selection (.cpuOnly, .cpuAndGPU, .all).
Concrete walkthrough
Convert a PyTorch model
import torch
import coremltools as ct
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(768, 768)
def forward(self, x):
return torch.relu(self.linear(x))
m = M().eval()
example_input = torch.randn(1, 768)
traced = torch.jit.trace(m, example_input) # or torch.export for newer flow
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(name="x", shape=example_input.shape)],
convert_to="mlprogram", # .mlpackage format
compute_precision=ct.precision.FLOAT16, # ANE prefers FP16
minimum_deployment_target=ct.target.iOS17,
)
mlmodel.save("my_model.mlpackage")Things to know:
convert_to="mlprogram"produces the modern format that supports ANE properly. The older"neuralnetwork"format is legacy.- FP16 is what the ANE wants. Native FP32 mostly falls back to GPU. INT8 / palettized weights work but with caveats.
minimum_deployment_targetmatters because newer iOS versions add ops; older models can run on newer targets but not vice-versa.
mlmodelc — the device-side compile
.mlpackage is the source-of-truth artifact. Before runtime it gets compiled into .mlmodelc (Core ML compiled) — a directory of binary blobs optimized for the specific chip the user’s device has. Xcode does this for you when you build the app:
- For an iPhone 15 Pro target:
.mlmodelcincludes ANE-optimized weight layouts. - For a Mac M3: includes M3-tuned compute graph.
- For older iPhone SE: falls back to CPU-friendly layout.
A single .mlpackage produces multiple .mlmodelc variants — one per supported architecture — bundled into the app. iOS picks the right one at install time.
Loading and running
Swift:
import CoreML
let model = try! MyModel(configuration: MLModelConfiguration())
let input = MyModelInput(x: try! MLMultiArray(shape: [1, 768], dataType: .float16))
let output = try! model.prediction(input: input)
// Or with manual compute-unit selection:
let config = MLModelConfiguration()
config.computeUnits = .all // try ANE first, fall back to GPU/CPU
let model2 = try! MyModel(configuration: config)computeUnits = .all means “use whatever is fastest for this graph.” .cpuOnly is for testing; .cpuAndNeuralEngine is for “force-ANE-or-CPU” if you want to skip GPU.
What runs on ANE (and what doesn’t)
The ANE is fast but picky. As of M4 / A18 generation:
Yes:
- Conv (1×1, 3×3, depthwise) at FP16
- Matmul at FP16 (and now also at INT4 with palettized weights)
- Pointwise (relu, gelu, sigmoid, layernorm at common shapes)
- Common transformer patterns (Q/K/V projections, attention with static cache)
No / falls back:
- Dynamic shapes (the ANE wants compile-time shapes)
- FP32 (mostly falls to GPU)
- Operations on unaligned memory layouts
- Some attention variants with unusual masking
- Certain reductions / softmax in specific shapes
Apple publishes a list of ANE-friendly op patterns; coremltools.optimize includes passes that rewrite eligible subgraphs to the ANE-friendly form.
The practical recipe: convert with compute_precision=FP16 and compute_units=.all, then check Instruments’ Core ML profiler to see which ops actually landed on ANE. If too many fall back, rewrite the model’s attention / softmax to use Core ML’s native KV-cache pattern (added in 2024+).
Palettization — Apple’s quantization
Core ML supports palettization: instead of storing weights as INT4/INT8, store them as lookup-table indices. Each tensor has a small palette (e.g., 16 unique FP16 values), and weights are 4-bit indices into the palette.
from coremltools.optimize.coreml import OpPalettizerConfig, palettize_weights
config = OpPalettizerConfig(mode="kmeans", n_bits=4)
palettized = palettize_weights(mlmodel, config)Why palettization vs INT4? The ANE has hardware support for LUT lookups during convolution. Palettized weights run natively on ANE; INT4 weights must dequantize-to-FP16 in software. So for ANE-targeted models, palettize, not quantize.
MLX — the research framework
MLX is Apple’s NumPy-flavored framework, optimized for Apple Silicon. Runs on unified memory (CPU and GPU share the same RAM, no copies). Lazy evaluation, similar to JAX. Used for research-grade work and as a faster prototyping path than PyTorch on Apple hardware.
import mlx.core as mx
import mlx.nn as nn
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(768, 768)
def __call__(self, x):
return mx.maximum(self.linear(x), 0)
model = MLP()
x = mx.random.normal((1, 768))
y = model(x) # lazy
mx.eval(y) # forces evaluationFor deployment to iOS production, you typically train in MLX (or PyTorch), convert via coremltools, ship as .mlpackage. MLX is the Apple-research-stack equivalent of JAX, not a deployment runtime.
iOS Intelligence — what’s running
Apple’s on-device Intelligence (Apple Intelligence, 2024+) runs:
- Generalized rewriting / summarization: a ~3B distilled model on ANE.
- Image generation (Genmoji, Image Playground): on-device diffusion models.
- Notification summaries: small LM on ANE.
- The Cloud-side larger models route through Private Cloud Compute when on-device isn’t enough.
The on-device portion uses Core ML + ANE end-to-end. Reading the WWDC sessions on this is unusually instructive for any production edge-AI work.
Run it in your browser — model-size simulator
The shape — INT4-palettized 3B as the iOS-app-bundleable sweet spot — matches Apple’s own intelligence recipe for the ~3B models that ship in the OS.
Quick check
Key takeaways
- Core ML = Apple’s on-device ML framework.
.mlpackagesource,.mlmodelcdevice-compiled, runs on CPU/GPU/ANE. - The ANE is Apple’s NPU — fast, picky. Wants FP16, static shapes, ANE-friendly patterns.
- Palettization, not generic INT4, for ANE-targeted models. Hardware LUT lookups.
- MLX is the research framework, Core ML is the deployment path. Same as JAX vs IREE in spirit.
- iOS Intelligence runs on this stack. Reading the WWDC sessions is the highest-signal preparation.
Go deeper
- DocsApple Developer — Core MLAuthoritative API reference. The "Run a model" + "Optimize a Core ML model" guides are the right starting point.
- Docscoremltools DocumentationThe conversion toolkit. The 2024+ docs cover torch.export-shaped graphs and palettization.
- DocsApple — Core ML HubSample apps, guides, the "deploy a transformer" tutorial.
- VideoWWDC 2024 — On-Device ML SessionsApple's own talks on iOS Intelligence, ANE optimization, and Core ML 2024 features. The single most useful video corpus for this lesson.
- DocsMLX DocumentationApple's research framework. The README + examples are enough to start prototyping.
- BlogApple — Introducing the Apple Foundation ModelsApple's system paper on the on-device + Private Cloud Compute architecture. Read for the production design.
- Repoapple/ml-stable-diffusionApple's own SD reference. Best worked example of Core ML + ANE for a non-LLM model.