Skip to content

Track 07 · Edge AI · On-Device

AI off the cloud.

Phones, browsers, NPUs, swarms, $50 boards. The runtimes, formats, recipes, and inference tricks that move LLMs out of data centers and onto every device — including a LAN of phones running 70B together.

Modules in this track

  • On-Device Runtimes — llama.cpp, ExecuTorch, Core ML, TFLite/LiteRT. The four production runtimes that ship 99% of mobile inference.
  • Browser (WebGPU) — WebLLM, transformers.js, WebNN. Every device with a browser is now an inference target.
  • Multimodal Edge — Whisper.cpp for speech, mobile VLMs (LLaVA, Phi-3.5-Vision, MiniCPM-V) for vision. The killer phone features.
  • Distributed Edge — EXO and swarm inference. Run 70B across 4 devices on a wifi network with no cloud.
  • NPU Stacks — Qualcomm Hexagon and Apple Neural Engine programming models, op limits, profilers.
  • Edge Quantization — GGUF, K-quants, importance-matrix calibration. The 4-bit world.
  • Distillation & Inference — Small LLM training recipes plus speculative decoding for 2× tok/s on phone.

What you’ll be able to do after

  • Ship a 4-bit LLM running on iOS, Android, and the browser — fully offline, three different runtimes.
  • Read llama.cpp’s source and understand the GGUF format end-to-end; write the same kind of code in Whisper.cpp.
  • Pick between Core ML, ExecuTorch, TFLite, llama.cpp, WebLLM, and a swarm cluster for any product constraint.
  • Quantize a 7B for edge deployment without losing more than 1–2 pts on your eval.
  • Cluster 4 phones / laptops on a LAN to run a 70B model with no cloud — pipeline-parallel inference at the edge.
  • Reason about the next 18 months of edge silicon: Hexagon, ANE, AMD XDNA, ARM Ethos, MediaTek APU, and the WebGPU/WebNN browser stack.