On-Device Inference
When you type ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?" on a MacBook Air and a sane streaming answer comes back at 35–50 tokens per second, the surprising thing is what isn’t happening: no datacenter call, no API key, no privacy lawyer involved, no per-token bill. The same Llama-3 weights that vLLM serves on an H100 in the cloud are running on an M3 chip with 8 GB of unified memory, on battery, on a coffeeshop wifi. The runtime is llama.cpp. The file is a quantized . The -ngl 99 flag offloads every layer to Apple Metal. The full open-source on-device stack is small enough that one GitHub repo is the whole thing.
What makes this work is the convergence of three things: (a Q4_K_M 8B fits in ~5 GB and loses well under one MMLU point), (mixed precision per row, no calibration data needed), and Apple/ARM kernels that fuse dequant directly into the matmul. The decode is bandwidth-bound on phone hardware — the GPU is plenty fast; the bottleneck is reading 5 GB of weights from RAM at every token — which is exactly why halving the bytes per weight roughly doubles the tokens-per-second. By 2026 the same recipe runs everywhere: llama.cpp for cross-platform, for the PyTorch-native iOS/Android path, MLX on Apple Silicon, NPUs on the new flagships. This lesson is the picker.
TL;DR
- llama.cpp is the universal runtime: CPU, Apple Metal, CUDA, Vulkan. GGUF format. If you don’t know what to run, run llama.cpp.
- MLX (Apple) uses unified memory on M-series chips — fastest path on a Mac. PyTorch-shaped Python API.
- ExecuTorch is PyTorch’s mobile/edge runtime; produces
.ptefiles for Android (NNAPI / Vulkan) and iOS (Core ML / MPS). - A 4-bit Q4_K_M quantized 8B model fits in ~5 GB and runs at 5–15 tokens/sec on a modern phone. Genuinely usable.
- The unlock is K-quants (mixed precision per row) and Apple/ARM kernels that fuse dequant + matmul.
Mental model
The convergence point is always quantize, then ship a single binary file. The runtime decides how to use it.
Concrete walkthrough — running Llama-3.2-3B-Instruct on a phone
Step 1: get the model. (Skip if you’ve downloaded a GGUF directly from a community model repo.)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make -j # or `cmake -B build -DGGML_METAL=on && cmake --build build` on Mac
# Convert HF safetensors → GGUF (BF16)
python convert_hf_to_gguf.py /path/to/Llama-3.2-3B-Instruct \
--outfile llama-3.2-3b-bf16.gguf
# Quantize: BF16 → Q4_K_M (~2.0 GB, sweet spot for quality/size)
./llama-quantize llama-3.2-3b-bf16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_MStep 2: run it.
# CPU / generic
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -p "Why is the sky blue?" -n 128
# Apple Metal (M-series)
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?"Step 3: ship to a phone. Two paths:
- Android: build llama.cpp for ARM64, ship the binary in your app, drop the GGUF in app storage. There are pre-built JNI wrappers (e.g.,
llama-jni,llama.cpp-androidexamples in the repo’sexamples/llama.android/). - iOS: same, but build with
-DGGML_METAL=on. Theexamples/llama.swiftui/example ships a working SwiftUI chat app you can compile in Xcode.
Real numbers
On consumer hardware (Q4_K_M, 8B model, prompt eval + decode):
| Device | Decode tok/s | Notes |
|---|---|---|
| iPhone 15 Pro (A17 Pro) | 12–18 | Metal backend; warm thermals matter |
| Pixel 8 Pro (Tensor G3) | 8–12 | CPU only; GPU compute uneven |
| MacBook Air M3 | 35–50 | Metal; 8GB unified memory tight |
| Raspberry Pi 5 | 2–4 | NEON CPU only; usable for tiny models |
| RTX 4090 (desktop CUDA) | 130–200 | reference |
Run it in your browser — what fits on your phone
You can’t run a 3B model in the browser yet — you can run a tiny one. This snippet computes how big a model your specific phone can handle.
Rule of thumb: you need ~params_B × 0.6 GB for a Q4_K_M model plus ~1 GB headroom. An 8 GB phone can run up to ~7B; 12 GB can run 13B; 70B is desktop-only without aggressive quantization.
Quick check
Key takeaways
- Quantize aggressively. Q4_K_M is the standard sweet spot; Q5_K_M if quality matters more than size; Q3_K_S for the absolute smallest viable run.
- Pick a runtime by your platform. Apple → MLX or llama.cpp+Metal. Android → llama.cpp+Vulkan or ExecuTorch+NNAPI. Cross-platform → llama.cpp.
- Memory bandwidth, not compute, is the bottleneck. On phone hardware, decode speed is set by how fast you can read the weights — which is exactly why quantization works.
- Test on the real device. Phone thermals throttle hard after 30 seconds — sustained tok/s is often half of peak.
Go deeper
- Repoggerganov/llama.cppThe reference. Read the README and the `examples/` folder.
- Repoml-explore/mlx-examplesApple's MLX with worked examples for Llama, Mistral, Whisper.
- DocsExecuTorch documentationPyTorch's mobile/edge runtime. The path of least resistance for production Android/iOS.
- BlogGGUF format on Hugging FaceWhat's actually in a GGUF file. Useful when debugging conversion issues.
- VideoKarpathy — Let's build GPTFor why the math reduces to weight-streaming and why quantization wins.
TL;DR
- llama.cpp is the universal runtime: CPU, Apple Metal, CUDA, Vulkan. GGUF format. If you don’t know what to run, run llama.cpp.
- MLX (Apple) uses unified memory on M-series chips — fastest path on a Mac. PyTorch-shaped Python API.
- ExecuTorch is PyTorch’s mobile/edge runtime; produces
.ptefiles for Android (NNAPI / Vulkan) and iOS (Core ML / MPS). - A 4-bit Q4_K_M quantized 8B model fits in ~5 GB and runs at 5–15 tokens/sec on a modern phone. Genuinely usable.
- The unlock is K-quants (mixed precision per row) and Apple/ARM kernels that fuse dequant + matmul.
Why this matters
Cloud inference has hard limits: latency, privacy, cost, offline. On-device flips them all. A 2026 phone can run an 8B model that, on benchmarks 18 months earlier, only ran in a datacenter. The open-source stack is good enough that anyone can ship a private offline AI assistant — if they know which runtime to pick and how to quantize.
Mental model
The convergence point is always quantize, then ship a single binary file. The runtime decides how to use it.
Concrete walkthrough — running Llama-3.2-3B-Instruct on a phone
Step 1: get the model. (Skip if you’ve downloaded a GGUF directly from a community model repo.)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make -j # or `cmake -B build -DGGML_METAL=on && cmake --build build` on Mac
# Convert HF safetensors → GGUF (BF16)
python convert_hf_to_gguf.py /path/to/Llama-3.2-3B-Instruct \
--outfile llama-3.2-3b-bf16.gguf
# Quantize: BF16 → Q4_K_M (~2.0 GB, sweet spot for quality/size)
./llama-quantize llama-3.2-3b-bf16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_MStep 2: run it.
# CPU / generic
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -p "Why is the sky blue?" -n 128
# Apple Metal (M-series)
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?"Step 3: ship to a phone. Two paths:
- Android: build llama.cpp for ARM64, ship the binary in your app, drop the GGUF in app storage. There are pre-built JNI wrappers (e.g.,
llama-jni,llama.cpp-androidexamples in the repo’sexamples/llama.android/). - iOS: same, but build with
-DGGML_METAL=on. Theexamples/llama.swiftui/example ships a working SwiftUI chat app you can compile in Xcode.
Real numbers on consumer hardware (Q4_K_M, 8B model, prompt eval + decode):
| Device | Decode tok/s | Notes |
|---|---|---|
| iPhone 15 Pro (A17 Pro) | 12–18 | Metal backend; warm thermals matter |
| Pixel 8 Pro (Tensor G3) | 8–12 | CPU only; GPU compute uneven |
| MacBook Air M3 | 35–50 | Metal; 8GB unified memory tight |
| Raspberry Pi 5 | 2–4 | NEON CPU only; usable for tiny models |
| RTX 4090 (desktop CUDA) | 130–200 | reference |
Run it in your browser
You can’t run a 3B model in the browser yet — you can run a tiny one. This snippet computes how big a model your specific phone can handle.
Rule of thumb: you need ~params_B × 0.6 GB for a Q4_K_M model plus ~1 GB headroom. An 8 GB phone can run up to ~7B; 12 GB can run 13B; 70B is desktop-only without aggressive quantization.
Quick check
Key takeaways
- Quantize aggressively. Q4_K_M is the standard sweet spot; Q5_K_M if quality matters more than size; Q3_K_S for the absolute smallest viable run.
- Pick a runtime by your platform. Apple → MLX or llama.cpp+Metal. Android → llama.cpp+Vulkan or ExecuTorch+NNAPI. Cross-platform → llama.cpp.
- Memory bandwidth, not compute, is the bottleneck. On phone hardware, decode speed is set by how fast you can read the weights — which is exactly why quantization works.
- Test on the real device. Phone thermals throttle hard after 30 seconds — sustained tok/s is often half of peak.
Go deeper
- Repoggerganov/llama.cppThe reference. Read the README and the `examples/` folder.
- Repoml-explore/mlx-examplesApple's MLX with worked examples for Llama, Mistral, Whisper.
- DocsExecuTorch documentationPyTorch's mobile/edge runtime. The path of least resistance for production Android/iOS.
- BlogGGUF format on Hugging FaceWhat's actually in a GGUF file. Useful when debugging conversion issues.
- VideoKarpathy — Let's build GPTFor why the math reduces to weight-streaming and why quantization wins.