On-Device Runtimes
Three runtimes ship 95% of on-device LLM workloads in 2026: llama.cpp (everywhere, GGUF), ExecuTorch (PyTorch-native mobile), Core ML (Apple Neural Engine). Each has a different sweet spot. This module walks all three end to end.
0 / 4 lessons ~58 min total
Module capstone — build it
The same 3B model on phone via three different runtimes Ship Llama-3.2-3B on iOS three ways — llama.cpp Metal, ExecuTorch + Core ML, native Core ML — and benchmark the gap.
Advanced · One focused weekend (~14 h, mostly XCode setup) · Runs on your phone
A repo with three working iOS apps (or three branches), each running Llama-3.2-3B. Measure tokens/sec, memory footprint, install size, and thermal-throttle curve for each. The artifact is the side-by-side benchmark report — recruiters and product folks both reach for it.
Build it — step by step
01 Set up XCode + a real device 60 min
Free Apple Developer account, target an iPhone you own. Get a hello-world SwiftUI app running on hardware (not simulator — the throttle curve only shows on real silicon).
checkpoint Your hello-world app is on your phone via XCode build-and-run.
02 Path 1 — llama.cpp + Metal 180 min
Build llama.cpp's Metal-enabled XCFramework. Bundle Q4_K_M Llama-3.2-3B GGUF. Wire prompt → tokens via the C API; measure tok/s on real device. (See [llama.cpp Internals](./llama-cpp-internals).)
checkpoint ~10–20 tok/s steady-state on iPhone 15 Pro for 3B Q4_K_M. **Verify Metal is actually running:** check the llama.cpp log for `ggml_metal_init: GPU name: Apple A17 Pro` (or your chip). If you see `using CPU backend`, Metal failed silently.
watch out Two pitfalls. (1) Metal init must happen on main thread before any inference call — check `ggml_metal_init()` returns non-null. (2) The Metal shader library has to be bundled as a resource; Xcode "Copy Bundle Resources" misses `.metallib` files unless explicitly added. If tok/s is suspiciously low (~2–3), Metal silently fell back to CPU.
03 Path 2 — ExecuTorch + Core ML delegate 180 min
torch.export → to_edge → CoreML partitioner → .pte. Bundle the .pte; load via the ExecuTorch Swift API. The Core ML delegate runs ANE-eligible subgraphs on the Neural Engine. (See [ExecuTorch](./executorch).)
checkpoint Streams tokens; tok/s comparable to or better than llama.cpp. **Confirm ANE is in the mix:** open Instruments → Core ML template, look for "Neural Engine" in the compute-unit pie. If 100% goes to GPU/CPU, partitioning rejected the ANE.
watch out Three traps. (1) Quantization in ExecuTorch is via PT2E — register the quantizer *before* `to_edge`, not after. (2) ANE rejects ops with dynamic shapes; if your KV cache uses dynamic seq_len, the whole subgraph falls back to GPU. Use a static-cache export configuration. (3) The Core ML delegate is opt-in — if you forget `CoreMLPartitioner()` in `to_edge_transform_and_lower`, you get a generic CPU/Metal .pte that runs but never touches ANE.
04 Path 3 — Native Core ML 180 min
Convert Llama-3.2-3B via coremltools to a `.mlpackage`. Use the new mlmodelc compiler (Xcode 15+) for ANE-aware compilation. Embed in app, run via CoreML Swift API. (See [Core ML & ANE](./coreml).)
checkpoint Native Core ML path produces tokens; ANE shows up in Instruments → Core ML compute-unit timeline.
watch out Four common Core ML quant quirks: (1) `compute_precision=ct.precision.FLOAT16` is ANE-friendly; INT8 weight-only via `ct.optimize.coreml` is too — but mixed-precision attention often breaks. (2) Attention conversion is finicky; expect to manually wire a `static_kv_cache` shape into `inputs=[ct.TensorType(shape=...)]`. (3) ANE has a 16K-element-per-tensor soft cap; long-context configs silently fall back to GPU even when conversion succeeds. (4) On older devices (pre-A17), the ANE may genuinely be unavailable for your op set — Instruments will show 0% ANE and that is *expected*. Document this fallback rather than chasing it.
05 Benchmark all three 90 min
Same prompt, same output length (256 tokens), 10 runs each. Plot tok/s, peak memory, install size, time-to-first-token, ANE/GPU/CPU compute split (via Instruments).
checkpoint Side-by-side bar chart with all three paths labeled. Numbers feel realistic (within 30% of each other typically).
06 README + push 60 min
Repo with three branches (or three subdirs); README explains the trade-offs of each runtime, embeds the benchmark plot, and recommends "use X if Y."
checkpoint A reader can pick a runtime for their constraint in 5 minutes by reading the README.
You walk away with
Three working LLM-on-phone implementations — the canonical edge AI engineer's portfolio piece Felt knowledge of which runtime to reach for in which situation Fluency with iOS profiling tools (Instruments power trace, ANE counters) A benchmark methodology you can re-run on every new model and runtime version Tools you'll use llama.cpp + Metal ExecuTorch + Core ML delegate Core ML (CoreMLTools conversion) iPhone 15 Pro or 16 Pro for testing XCode Instruments for profiling