Skip to content

On-Device Inference

When you type ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?" on a MacBook Air and a sane streaming answer comes back at 35–50 tokens per second, the surprising thing is what isn’t happening: no datacenter call, no API key, no privacy lawyer involved, no per-token bill. The same Llama-3 weights that vLLM serves on an H100 in the cloud are running on an M3 chip with 8 GB of unified memory, on battery, on a coffeeshop wifi. The runtime is llama.cpp. The file is a quantized . The -ngl 99 flag offloads every layer to Apple Metal. The full open-source on-device stack is small enough that one GitHub repo is the whole thing.

What makes this work is the convergence of three things: (a Q4_K_M 8B fits in ~5 GB and loses well under one MMLU point), (mixed precision per row, no calibration data needed), and Apple/ARM kernels that fuse dequant directly into the matmul. The decode is bandwidth-bound on phone hardware — the GPU is plenty fast; the bottleneck is reading 5 GB of weights from RAM at every token — which is exactly why halving the bytes per weight roughly doubles the tokens-per-second. By 2026 the same recipe runs everywhere: llama.cpp for cross-platform, for the PyTorch-native iOS/Android path, MLX on Apple Silicon, NPUs on the new flagships. This lesson is the picker.

TL;DR

  • llama.cpp is the universal runtime: CPU, Apple Metal, CUDA, Vulkan. GGUF format. If you don’t know what to run, run llama.cpp.
  • MLX (Apple) uses unified memory on M-series chips — fastest path on a Mac. PyTorch-shaped Python API.
  • ExecuTorch is PyTorch’s mobile/edge runtime; produces .pte files for Android (NNAPI / Vulkan) and iOS (Core ML / MPS).
  • A 4-bit Q4_K_M quantized 8B model fits in ~5 GB and runs at 5–15 tokens/sec on a modern phone. Genuinely usable.
  • The unlock is K-quants (mixed precision per row) and Apple/ARM kernels that fuse dequant + matmul.

Mental model

The convergence point is always quantize, then ship a single binary file. The runtime decides how to use it.

Concrete walkthrough — running Llama-3.2-3B-Instruct on a phone

Step 1: get the model. (Skip if you’ve downloaded a GGUF directly from a community model repo.)

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp make -j # or `cmake -B build -DGGML_METAL=on && cmake --build build` on Mac # Convert HF safetensors → GGUF (BF16) python convert_hf_to_gguf.py /path/to/Llama-3.2-3B-Instruct \ --outfile llama-3.2-3b-bf16.gguf # Quantize: BF16 → Q4_K_M (~2.0 GB, sweet spot for quality/size) ./llama-quantize llama-3.2-3b-bf16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_M

Step 2: run it.

# CPU / generic ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -p "Why is the sky blue?" -n 128 # Apple Metal (M-series) ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?"

Step 3: ship to a phone. Two paths:

  • Android: build llama.cpp for ARM64, ship the binary in your app, drop the GGUF in app storage. There are pre-built JNI wrappers (e.g., llama-jni, llama.cpp-android examples in the repo’s examples/llama.android/).
  • iOS: same, but build with -DGGML_METAL=on. The examples/llama.swiftui/ example ships a working SwiftUI chat app you can compile in Xcode.

Real numbers

On consumer hardware (Q4_K_M, 8B model, prompt eval + decode):

DeviceDecode tok/sNotes
iPhone 15 Pro (A17 Pro)12–18Metal backend; warm thermals matter
Pixel 8 Pro (Tensor G3)8–12CPU only; GPU compute uneven
MacBook Air M335–50Metal; 8GB unified memory tight
Raspberry Pi 52–4NEON CPU only; usable for tiny models
RTX 4090 (desktop CUDA)130–200reference

Run it in your browser — what fits on your phone

You can’t run a 3B model in the browser yet — you can run a tiny one. This snippet computes how big a model your specific phone can handle.

Python — editablePlug in your device's RAM and see what fits.
Ctrl+Enter to run

Rule of thumb: you need ~params_B × 0.6 GB for a Q4_K_M model plus ~1 GB headroom. An 8 GB phone can run up to ~7B; 12 GB can run 13B; 70B is desktop-only without aggressive quantization.

Quick check

Quick check
You want to ship a private offline chat app on iPhones with 8 GB of RAM. Which is the most realistic plan for 2026?

Key takeaways

  1. Quantize aggressively. Q4_K_M is the standard sweet spot; Q5_K_M if quality matters more than size; Q3_K_S for the absolute smallest viable run.
  2. Pick a runtime by your platform. Apple → MLX or llama.cpp+Metal. Android → llama.cpp+Vulkan or ExecuTorch+NNAPI. Cross-platform → llama.cpp.
  3. Memory bandwidth, not compute, is the bottleneck. On phone hardware, decode speed is set by how fast you can read the weights — which is exactly why quantization works.
  4. Test on the real device. Phone thermals throttle hard after 30 seconds — sustained tok/s is often half of peak.

Go deeper

TL;DR

  • llama.cpp is the universal runtime: CPU, Apple Metal, CUDA, Vulkan. GGUF format. If you don’t know what to run, run llama.cpp.
  • MLX (Apple) uses unified memory on M-series chips — fastest path on a Mac. PyTorch-shaped Python API.
  • ExecuTorch is PyTorch’s mobile/edge runtime; produces .pte files for Android (NNAPI / Vulkan) and iOS (Core ML / MPS).
  • A 4-bit Q4_K_M quantized 8B model fits in ~5 GB and runs at 5–15 tokens/sec on a modern phone. Genuinely usable.
  • The unlock is K-quants (mixed precision per row) and Apple/ARM kernels that fuse dequant + matmul.

Why this matters

Cloud inference has hard limits: latency, privacy, cost, offline. On-device flips them all. A 2026 phone can run an 8B model that, on benchmarks 18 months earlier, only ran in a datacenter. The open-source stack is good enough that anyone can ship a private offline AI assistant — if they know which runtime to pick and how to quantize.

Mental model

The convergence point is always quantize, then ship a single binary file. The runtime decides how to use it.

Concrete walkthrough — running Llama-3.2-3B-Instruct on a phone

Step 1: get the model. (Skip if you’ve downloaded a GGUF directly from a community model repo.)

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp make -j # or `cmake -B build -DGGML_METAL=on && cmake --build build` on Mac # Convert HF safetensors → GGUF (BF16) python convert_hf_to_gguf.py /path/to/Llama-3.2-3B-Instruct \ --outfile llama-3.2-3b-bf16.gguf # Quantize: BF16 → Q4_K_M (~2.0 GB, sweet spot for quality/size) ./llama-quantize llama-3.2-3b-bf16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_M

Step 2: run it.

# CPU / generic ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -p "Why is the sky blue?" -n 128 # Apple Metal (M-series) ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?"

Step 3: ship to a phone. Two paths:

  • Android: build llama.cpp for ARM64, ship the binary in your app, drop the GGUF in app storage. There are pre-built JNI wrappers (e.g., llama-jni, llama.cpp-android examples in the repo’s examples/llama.android/).
  • iOS: same, but build with -DGGML_METAL=on. The examples/llama.swiftui/ example ships a working SwiftUI chat app you can compile in Xcode.

Real numbers on consumer hardware (Q4_K_M, 8B model, prompt eval + decode):

DeviceDecode tok/sNotes
iPhone 15 Pro (A17 Pro)12–18Metal backend; warm thermals matter
Pixel 8 Pro (Tensor G3)8–12CPU only; GPU compute uneven
MacBook Air M335–50Metal; 8GB unified memory tight
Raspberry Pi 52–4NEON CPU only; usable for tiny models
RTX 4090 (desktop CUDA)130–200reference

Run it in your browser

You can’t run a 3B model in the browser yet — you can run a tiny one. This snippet computes how big a model your specific phone can handle.

Python — editablePlug in your device's RAM and see what fits.
Ctrl+Enter to run

Rule of thumb: you need ~params_B × 0.6 GB for a Q4_K_M model plus ~1 GB headroom. An 8 GB phone can run up to ~7B; 12 GB can run 13B; 70B is desktop-only without aggressive quantization.

Quick check

Quick check
You want to ship a private offline chat app on iPhones with 8 GB of RAM. Which is the most realistic plan for 2026?

Key takeaways

  1. Quantize aggressively. Q4_K_M is the standard sweet spot; Q5_K_M if quality matters more than size; Q3_K_S for the absolute smallest viable run.
  2. Pick a runtime by your platform. Apple → MLX or llama.cpp+Metal. Android → llama.cpp+Vulkan or ExecuTorch+NNAPI. Cross-platform → llama.cpp.
  3. Memory bandwidth, not compute, is the bottleneck. On phone hardware, decode speed is set by how fast you can read the weights — which is exactly why quantization works.
  4. Test on the real device. Phone thermals throttle hard after 30 seconds — sustained tok/s is often half of peak.

Go deeper