07Track 07 · Edge AI · On-Device
AI off the cloud.
Phones, browsers, NPUs, swarms, $50 boards. The runtimes, formats, recipes, and inference tricks that move LLMs out of data centers and onto every device — including a LAN of phones running 70B together.
- — llama.cpp, ExecuTorch, Core ML, TFLite/LiteRT. The four production runtimes that ship 99% of mobile inference.
- — WebLLM, transformers.js, WebNN. Every device with a browser is now an inference target.
- — Whisper.cpp for speech, mobile VLMs (LLaVA, Phi-3.5-Vision, MiniCPM-V) for vision. The killer phone features.
- — EXO and swarm inference. Run 70B across 4 devices on a wifi network with no cloud.
- — Qualcomm Hexagon and Apple Neural Engine programming models, op limits, profilers.
- — GGUF, K-quants, importance-matrix calibration. The 4-bit world.
- — Small LLM training recipes plus speculative decoding for 2× tok/s on phone.
- Ship a 4-bit LLM running on iOS, Android, and the browser — fully offline, three different runtimes.
- Read llama.cpp’s source and understand the GGUF format end-to-end; write the same kind of code in Whisper.cpp.
- Pick between Core ML, ExecuTorch, TFLite, llama.cpp, WebLLM, and a swarm cluster for any product constraint.
- Quantize a 7B for edge deployment without losing more than 1–2 pts on your eval.
- Cluster 4 phones / laptops on a LAN to run a 70B model with no cloud — pipeline-parallel inference at the edge.
- Reason about the next 18 months of edge silicon: Hexagon, ANE, AMD XDNA, ARM Ethos, MediaTek APU, and the WebGPU/WebNN browser stack.