Edge AI · On-Device

Modules in this track

On-Device Runtimes — llama.cpp, ExecuTorch, Core ML, TFLite/LiteRT. The four production runtimes that ship 99% of mobile inference.
Browser (WebGPU) — WebLLM, transformers.js, WebNN. Every device with a browser is now an inference target.
Multimodal Edge — Whisper.cpp for speech, mobile VLMs (LLaVA, Phi-3.5-Vision, MiniCPM-V) for vision. The killer phone features.
Distributed Edge — EXO and swarm inference. Run 70B across 4 devices on a wifi network with no cloud.
NPU Stacks — Qualcomm Hexagon and Apple Neural Engine programming models, op limits, profilers.
Edge Quantization — GGUF, K-quants, importance-matrix calibration. The 4-bit world.
Distillation & Inference — Small LLM training recipes plus speculative decoding for 2× tok/s on phone.

Ship a 4-bit LLM running on iOS, Android, and the browser — fully offline, three different runtimes.
Read llama.cpp’s source and understand the GGUF format end-to-end; write the same kind of code in Whisper.cpp.
Pick between Core ML, ExecuTorch, TFLite, llama.cpp, WebLLM, and a swarm cluster for any product constraint.
Quantize a 7B for edge deployment without losing more than 1–2 pts on your eval.
Cluster 4 phones / laptops on a LAN to run a 70B model with no cloud — pipeline-parallel inference at the edge.
Reason about the next 18 months of edge silicon: Hexagon, ANE, AMD XDNA, ARM Ethos, MediaTek APU, and the WebGPU/WebNN browser stack.