On-Device Inference

When you type ./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?" on a MacBook Air and a sane streaming answer comes back at 35–50 tokens per second, the surprising thing is what isn’t happening: no datacenter call, no API key, no privacy lawyer involved, no per-token bill. The same Llama-3 weights that vLLM serves on an H100 in the cloud are running on an M3 chip with 8 GB of unified memory, on battery, on a coffeeshop wifi. The runtime is llama.cpp. The file is a quantized . The -ngl 99 flag offloads every layer to Apple Metal. The full open-source on-device stack is small enough that one GitHub repo is the whole thing.

What makes this work is the convergence of three things: (a Q4_K_M 8B fits in ~5 GB and loses well under one MMLU point), (mixed precision per row, no calibration data needed), and Apple/ARM kernels that fuse dequant directly into the matmul. The decode is bandwidth-bound on phone hardware — the GPU is plenty fast; the bottleneck is reading 5 GB of weights from RAM at every token — which is exactly why halving the bytes per weight roughly doubles the tokens-per-second. By 2026 the same recipe runs everywhere: llama.cpp for cross-platform, for the PyTorch-native iOS/Android path, MLX on Apple Silicon, NPUs on the new flagships. This lesson is the picker.

TL;DR

llama.cpp is the universal runtime: CPU, Apple Metal, CUDA, Vulkan. GGUF format. If you don’t know what to run, run llama.cpp.
MLX (Apple) uses unified memory on M-series chips — fastest path on a Mac. PyTorch-shaped Python API.
ExecuTorch is PyTorch’s mobile/edge runtime; produces .pte files for Android (NNAPI / Vulkan) and iOS (Core ML / MPS).
A 4-bit Q4_K_M quantized 8B model fits in ~5 GB and runs at 5–15 tokens/sec on a modern phone. Genuinely usable.
The unlock is K-quants (mixed precision per row) and Apple/ARM kernels that fuse dequant + matmul.

Mental model

The convergence point is always quantize, then ship a single binary file. The runtime decides how to use it.

Concrete walkthrough — running Llama-3.2-3B-Instruct on a phone

Step 1: get the model. (Skip if you’ve downloaded a GGUF directly from a community model repo.)


git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make -j  # or `cmake -B build -DGGML_METAL=on && cmake --build build` on Mac
 
# Convert HF safetensors → GGUF (BF16)
python convert_hf_to_gguf.py /path/to/Llama-3.2-3B-Instruct \
    --outfile llama-3.2-3b-bf16.gguf
 
# Quantize: BF16 → Q4_K_M (~2.0 GB, sweet spot for quality/size)
./llama-quantize llama-3.2-3b-bf16.gguf llama-3.2-3b-Q4_K_M.gguf Q4_K_M

Step 2: run it.


# CPU / generic
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -p "Why is the sky blue?" -n 128
 
# Apple Metal (M-series)
./llama-cli -m llama-3.2-3b-Q4_K_M.gguf -ngl 99 -p "Why is the sky blue?"

Step 3: ship to a phone. Two paths:

Android: build llama.cpp for ARM64, ship the binary in your app, drop the GGUF in app storage. There are pre-built JNI wrappers (e.g., llama-jni, llama.cpp-android examples in the repo’s examples/llama.android/).
iOS: same, but build with -DGGML_METAL=on. The examples/llama.swiftui/ example ships a working SwiftUI chat app you can compile in Xcode.

Real numbers

On consumer hardware (Q4_K_M, 8B model, prompt eval + decode):

Device	Decode tok/s	Notes
iPhone 15 Pro (A17 Pro)	12–18	Metal backend; warm thermals matter
Pixel 8 Pro (Tensor G3)	8–12	CPU only; GPU compute uneven
MacBook Air M3	35–50	Metal; 8GB unified memory tight
Raspberry Pi 5	2–4	NEON CPU only; usable for tiny models
RTX 4090 (desktop CUDA)	130–200	reference

Run it in your browser — what fits on your phone

You can’t run a 3B model in the browser yet — you can run a tiny one. This snippet computes how big a model your specific phone can handle.

Python — editablePlug in your device's RAM and see what fits.

def fits_on_device(model_params_b, bits=4, ram_gb=8, headroom_gb=2):
  """How much RAM a quantized model needs, vs. how much your device has."""
  bytes_per_weight = bits / 8
  weight_gb = model_params_b * 1e9 * bytes_per_weight / 1024**3
  kv_overhead_gb = 0.5  # rough, for short contexts
  needed = weight_gb + kv_overhead_gb
  available = ram_gb - headroom_gb
  return weight_gb, needed, available, needed <= available

# Common model sizes × your phone
for params_b in [1, 3, 7, 8, 13, 70]:
  for ram in [4, 6, 8, 12, 16]:
      wgt, need, avail, ok = fits_on_device(params_b, bits=4, ram_gb=ram)
      flag = "✓" if ok else "✗"
      print(f"{flag} {params_b:>3}B @ Q4 needs {need:>4.1f} GB | "
            f"device w/ {ram:>2} GB RAM has {avail:>4.1f} GB free")
  print()

def fits_on_device(model_params_b, bits=4, ram_gb=8, headroom_gb=2):
  """How much RAM a quantized model needs, vs. how much your device has."""
  bytes_per_weight = bits / 8
  weight_gb = model_params_b * 1e9 * bytes_per_weight / 1024**3
  kv_overhead_gb = 0.5  # rough, for short contexts
  needed = weight_gb + kv_overhead_gb
  available = ram_gb - headroom_gb
  return weight_gb, needed, available, needed <= available

# Common model sizes × your phone
for params_b in [1, 3, 7, 8, 13, 70]:
  for ram in [4, 6, 8, 12, 16]:
      wgt, need, avail, ok = fits_on_device(params_b, bits=4, ram_gb=ram)
      flag = "✓" if ok else "✗"
      print(f"{flag} {params_b:>3}B @ Q4 needs {need:>4.1f} GB | "
            f"device w/ {ram:>2} GB RAM has {avail:>4.1f} GB free")
  print()

def fits_on_device(model_params_b, bits=4, ram_gb=8, headroom_gb=2):
  """How much RAM a quantized model needs, vs. how much your device has."""
  bytes_per_weight = bits / 8
  weight_gb = model_params_b * 1e9 * bytes_per_weight / 1024**3
  kv_overhead_gb = 0.5  # rough, for short contexts
  needed = weight_gb + kv_overhead_gb
  available = ram_gb - headroom_gb
  return weight_gb, needed, available, needed <= available

# Common model sizes × your phone
for params_b in [1, 3, 7, 8, 13, 70]:
  for ram in [4, 6, 8, 12, 16]:
      wgt, need, avail, ok = fits_on_device(params_b, bits=4, ram_gb=ram)
      flag = "✓" if ok else "✗"
      print(f"{flag} {params_b:>3}B @ Q4 needs {need:>4.1f} GB | "
            f"device w/ {ram:>2} GB RAM has {avail:>4.1f} GB free")
  print()

Ctrl+Enter to run

Rule of thumb: you need ~params_B × 0.6 GB for a Q4_K_M model plus ~1 GB headroom. An 8 GB phone can run up to ~7B; 12 GB can run 13B; 70B is desktop-only without aggressive quantization.

Quick check

You want to ship a private offline chat app on iPhones with 8 GB of RAM. Which is the most realistic plan for 2026?

Key takeaways

Quantize aggressively. Q4_K_M is the standard sweet spot; Q5_K_M if quality matters more than size; Q3_K_S for the absolute smallest viable run.
Pick a runtime by your platform. Apple → MLX or llama.cpp+Metal. Android → llama.cpp+Vulkan or ExecuTorch+NNAPI. Cross-platform → llama.cpp.
Memory bandwidth, not compute, is the bottleneck. On phone hardware, decode speed is set by how fast you can read the weights — which is exactly why quantization works.
Test on the real device. Phone thermals throttle hard after 30 seconds — sustained tok/s is often half of peak.

Go deeper

Repoggerganov/llama.cppThe reference. Read the README and the `examples/` folder.
Repoml-explore/mlx-examplesApple's MLX with worked examples for Llama, Mistral, Whisper.
DocsExecuTorch documentationPyTorch's mobile/edge runtime. The path of least resistance for production Android/iOS.
BlogGGUF format on Hugging FaceWhat's actually in a GGUF file. Useful when debugging conversion issues.
VideoKarpathy — Let's build GPT · Andrej KarpathyFor why the math reduces to weight-streaming and why quantization wins.