Skip to content

Multimodal Edge

Voice and vision are the killer mobile features. Voice because typing on a phone is friction; vision because the camera is always on. This module covers the two most useful 2026 stacks: Whisper.cpp for streaming speech (the ggml-runtime sibling of llama.cpp) and mobile VLMs (Phi-3.5-Vision, LLaVA-mobile, MiniCPM-V — all under 8 GB at 4-bit).