Serve & Ship
A model file isn’t a product. This module covers the full last mile: which inference server to pick, how to run a 4-bit model on your phone, the 4 levers that get you 5–10× cheaper inference, and the observability stack that tells you what users are actually doing.
0 / 4 lessons ~59 min total
Module capstone — build it
A real chat app — running fully on your phone, offline A native iOS or Android app with a 4-bit LLM inside, generating tokens at airplane-mode-speed. No cloud.
Advanced · One focused weekend (~12 h) · Runs on your phone
A SwiftUI (iOS) or Kotlin Compose (Android) chat app embedding llama.cpp with a Q4_K_M Llama-3.2-3B GGUF in the app bundle. The artifact is the app + a benchmark report (tok/s under sustained load + thermal-throttling curve).
Build it — step by step
01 Build llama.cpp for your device 90 min
Clone llama.cpp. iOS: build the Metal-enabled XCFramework via `make IOS=1` or the provided CMake. Android: NDK build with Vulkan. Verify with the included `main` binary on a desktop first to confirm your build chain works.
checkpoint `./main -m model.gguf -p "hello"` produces output on the device or simulator.
watch out Metal/Vulkan support is opt-in; without it you fall back to CPU and lose 5–10× perf. Check the build flags carefully.
02 Quantize and bundle a 3B model 60 min
Download Llama-3.2-3B-Instruct, run llama.cpp's `convert_hf_to_gguf.py` then `quantize -m model.gguf model-q4.gguf Q4_K_M`. The result is ~2 GB. Bundle it inside your app (or download on first launch).
checkpoint You have `model-q4.gguf` of ~2 GB. Loading it on a Mac in `main` produces sensible chat output.
watch out iOS app-store apps have a 4 GB IPA limit but no per-asset cap. Android APK has a 200 MB cap — use Asset Pack delivery if you bundle in-app, or download-on-first-launch.
03 Wire llama.cpp into the native UI 120 min
iOS: write a small Swift wrapper calling the C API (`llama_load_model_from_file`, `llama_eval`, `llama_sample_token`). Android: same via JNI. Expose `streamTokens(prompt) -> AsyncSequence<Token>`.
checkpoint A unit test in your app project: feed a prompt, get a generator that yields tokens one at a time.
watch out On iOS, the Metal backend must be initialized on the main thread or it fails silently. Check `ggml_metal_init()` return.
04 Ship a chat UI 120 min
iOS: SwiftUI ChatView with a TextField + List of messages. Android: Compose equivalent. Wire the streamTokens generator to update the UI per token. Add the system message, user/assistant alternation.
checkpoint Type "hello", see tokens stream into the UI in real time, no cloud.
05 Benchmark under sustained load 60 min
Generate 500 tokens 10 times back to back. Plot tokens/sec per run on a chart in-app. The phone will warm up; tokens/s typically drops 30–60% after the first 30 seconds. Note the steady-state.
checkpoint You have a chart showing the thermal-throttle curve. Steady-state on a recent iPhone Pro: ~10–20 tok/s for a 3B Q4_K_M.
watch out Benchmarks vary wildly between simulator (slow) and device (fast); always benchmark on hardware.
06 Polish + ship 90 min
Add airplane-mode test (works), error handling (no model file, OOM), settings (temperature, max tokens). Record a 30-second video of the app in airplane mode for the README.
checkpoint Side-loaded build runs end-to-end on your phone. Repo has an APK or .ipa download link, the source, the benchmark plot, and the demo video.
You walk away with
A working LLM chat app on your phone — fully offline, fully local Fluency with llama.cpp's C API, GGUF quantization, native mobile build pipelines A benchmark methodology for thermal-throttling on mobile silicon A demo video that's genuinely fun to show people Tools you'll use llama.cpp (Metal / Vulkan / NEON) GGUF Q4_K_M SwiftUI / Kotlin Compose Llama-3.2-3B-Instruct streaming token decode