Whisper.cpp & On-Device Speech
In a cloud-serving world, “transcribe this audio” is one HTTP POST. The latency budget is round-trip + a few hundred ms of GPU time, the cost is whatever OpenAI charges per minute, and the runtime is whichever Python container the cloud team set up. The user’s microphone bytes leave the device and come back as text.
That model has three problems on a phone. Latency: 200–800 ms of round-trip is enough to make voice UX feel laggy. Cost: even at $0.006/minute, an “always-on dictation” app loses money on every active user. Privacy: the user’s voice goes over the network. So the on-device speech world has converged on a different stack — Whisper.cpp, the same author’s -based runtime that powers llama.cpp, but tuned for an encoder-decoder model with audio in and tokens out.
The shape is familiar: Python is the SDK, C++ is what runs on the device, and the runtime is a single small library you link into your app. The new wrinkles are streaming (you don’t have the whole audio file when you start) and VAD (the unsexy classifier that decides whether to even spend CPU on this 30 ms of audio). Get those right and small.en Q5_1 transcribes faster than the user can talk on a 3-year-old phone.
TL;DR
- Whisper.cpp is Georgi Gerganov’s port of OpenAI Whisper to the same
ggmlruntime that powers llama.cpp. Quantized models (Q5_1 / Q8_0) hit real-time-or-better on a phone CPU; Metal / Vulkan / CUDA backends 2–4× that on a GPU. - The runtime exposes a tiny C API (
whisper_init_from_file,whisper_full,whisper_full_get_segment_text) and bindings for every language. ~1 MB binary, ~244 MB forsmall.enQ5_1. - Voice activity detection (VAD) is the unsexy piece that makes streaming actually work — Silero-VAD or webrtcvad runs in under 1% CPU and detects speech-vs-silence in 30 ms windows. Without VAD, you transcribe silence and waste battery.
- Chunking is the rest of the streaming story: feed Whisper 5–30 s windows with overlap, deduplicate the overlap, emit partial text as the latest chunk arrives. The Whisper.cpp
streamexample codifies the pattern. - The right model size:
small.en(244 MB) for English-only on phone,base(74 MB) if you’re memory-constrained,medium(~770 MB) on a laptop.large-v3only on a desktop GPU.
Why this matters
Voice is the killer phone interaction — typing on a touchscreen is the friction that AI assistants are supposed to remove. Cloud STT works but has three problems: latency (200–800 ms round trip), cost (~$0.006/min adds up), and privacy (your prompts cross the network). Whisper.cpp eliminates all three.
The 2026 reality: Whisper.cpp has been in production in serious apps for two years (MacWhisper, SuperWhisper, the iOS app Voice In Voice). It works. The lesson is how it works — because the same ggml patterns power every other on-device speech model coming next (Distil-Whisper, Moshi, Whisper-large-v3-turbo).
Mental model
Three buffer scales:
- 30 ms VAD frames — cheap classifier deciding whether to even spend CPU on this audio.
- 5–30 s Whisper window — the model’s native context. Below 5 s the encoder gets confused; above 30 s you’re past the model’s training context.
- Final transcript — concatenated, deduplicated text from all the windows.
What Whisper actually does internally
Whisper is an encoder-decoder transformer trained on 680K hours of multilingual audio. Two halves:
- Encoder: 80-channel log-mel spectrogram → 6/12/24 transformer blocks. Output:
(seq_len, d_model)features. Runs once per audio window. - Decoder: 6/12/24 transformer blocks with cross-attention to encoder output. Generates tokens autoregressively, one at a time, like any LLM. Tokens are byte-pair pieces over a 51,865-piece tokenizer.
The decoder is GPT-2-shaped. Whisper.cpp’s ggml runtime treats it identically to a Llama decode — same , same speculative-decoding tricks apply (you can speculative-decode Whisper-large with Distil-Whisper as the draft model).
The Whisper.cpp C API
This is what runs on the device — a few function calls into a static library:
// Load + warm up
struct whisper_context* ctx = whisper_init_from_file("ggml-small.en-q5_1.bin");
// Configure
struct whisper_full_params p = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
p.print_progress = false;
p.print_realtime = false;
p.language = "en";
p.translate = false;
p.n_threads = 4;
// Run on a 30s audio buffer (16kHz mono float)
whisper_full(ctx, p, audio_pcm, n_samples);
// Pull segments
const int n = whisper_full_n_segments(ctx);
for (int i = 0; i < n; i++) {
const char* text = whisper_full_get_segment_text(ctx, i);
int64_t t0 = whisper_full_get_segment_t0(ctx, i); // start, hundredths of seconds
int64_t t1 = whisper_full_get_segment_t1(ctx, i);
printf("[%lld..%lld] %s\n", t0, t1, text);
}
whisper_free(ctx);That’s the whole runtime API. Bindings exist for Swift, Kotlin, Python, JS — they all wrap this same C surface.
VAD: the streaming linchpin
A naive streaming impl runs Whisper on every 5-second window of audio. That’s ~1 W of CPU continuously. The fix: only run Whisper when there’s actually speech.
The orchestration code is usually written in the host language — Python in the desktop case, Swift/Kotlin on phones. The math is light enough that it doesn’t matter:
# pseudo-code with silero-vad
import silero_vad
vad = silero_vad.load() # ~1 MB model
buffer = []
in_speech = False
silence_ms = 0
for frame_30ms in audio_stream():
is_speech = vad(frame_30ms)
if is_speech:
buffer.append(frame_30ms)
in_speech = True
silence_ms = 0
elif in_speech:
silence_ms += 30
buffer.append(frame_30ms) # keep trailing silence (Whisper expects it)
if silence_ms > 600: # end of utterance
transcribe(buffer)
buffer = []
in_speech = FalseSilero-VAD: ~1 MB, under 0.1 ms per 30 ms frame on a phone CPU. webrtcvad is even smaller (a hand-tuned signal-processing classifier — no neural net). Either is fine.
The 600 ms silence threshold is the magic number — long enough to cleanly separate utterances, short enough that the user doesn’t feel the app is dragging.
Chunking with overlap
Whisper is trained on 30 s windows. For longer audio you need overlapping chunks:
- Chunk every 30 s with a 5 s overlap.
- Run Whisper on chunk N.
- The first 5 s of chunk N’s transcript should match the last 5 s of chunk N-1’s transcript. Use that to align and deduplicate.
- Emit the non-overlapping portion of each chunk to the user.
Whisper.cpp’s stream example handles all of this. For true low-latency streaming (token-by-token as the user speaks) the trick is to run Whisper on a sliding 5 s window with frequent re-runs — accept that the early text changes as more audio arrives. This is the SuperWhisper approach.
Quantization choices, in numbers
| Model | FP16 size | Q5_1 size | Q8_0 size | Phone tok/s | Quality drop vs FP16 |
|---|---|---|---|---|---|
tiny.en | 75 MB | 27 MB | 42 MB | 8× realtime | ~negligible |
base.en | 142 MB | 50 MB | 78 MB | 4× realtime | ~negligible |
small.en | 466 MB | 244 MB | 392 MB | 2× realtime | under 1 WER point |
medium | 1.5 GB | 770 MB | 1.2 GB | ~realtime on M2 only | under 1 WER |
large-v3 | 3 GB | 1.6 GB | 2.4 GB | desktop GPU only | best |
small.en Q5_1 is the production sweet spot for English-only mobile. base if you’re squeezed on RAM. The Distil-Whisper variants (distil-small.en, ~166 MB) match small.en quality at half the size and 2× the speed — released by Hugging Face in late 2023, fully ggml-compatible.
What “real-time” means honestly
“Real-time factor (RTF)” = (model latency) / (audio duration). RTF=1 means you can transcribe live. RTF=0.5 means twice as fast as live.
- Whisper.cpp
small.enQ5_1 on iPhone 15 Pro: RTF ~0.4–0.5 (live transcription with a small lag). - Same on iPhone 11: RTF ~0.8 (just barely keeps up).
mediumon iPhone 15 Pro: RTF ~1.2 (falls behind on long audio).mediumon M2 Air: RTF ~0.3 (very comfortable live).
The streaming UX target: end-to-end user-perceived lag under 500 ms. With VAD’s 600 ms post-silence delay + Whisper’s ~200 ms inference, you’re at 800 ms — which feels snappy but not instantaneous. The lower-bound is dominated by VAD’s silence-end detection, not Whisper.
Run it in your browser
A useful demo: simulate a streaming Whisper session — VAD-gated chunks, with quality-vs-latency trade-off visible. The math behind RTF.
The headline insight: VAD silence threshold dominates perceived latency more than the model size does, until you pick a model the device can’t keep up with.
Quick check
Key takeaways
- Whisper.cpp is Whisper running on the
ggmlruntime — same C-API DNA as llama.cpp, ships everywhere ggml ships. small.enQ5_1 is the production English mobile sweet spot: 244 MB, RTF ~0.4 on a recent phone.- VAD is required for streaming — Silero-VAD or webrtcvad. Cuts CPU by 5–10× and improves UX.
- 600 ms silence threshold is the magic UX number — short enough to feel snappy, long enough to cleanly separate utterances.
- Distil-Whisper is a same-quality, smaller, faster variant — production-ready in late 2023.
- Thermal throttling is the iOS gotcha — older chips fall behind after ~30 s of continuous transcription.
Go deeper
- Repoggerganov/whisper.cppThe reference implementation. Read `examples/stream/stream.cpp` for the canonical streaming pattern.
- PaperRobust Speech Recognition via Large-Scale Weak SupervisionThe original Whisper paper. Useful for the encoder/decoder architecture and the multilingual training story.
- PaperDistil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo LabellingDistil-Whisper details — same quality, half the params, 2× the speed. Fully ggml-compatible.
- Reposnakers4/silero-vadThe VAD model — 1 MB, real-time on phone CPU. The streaming-Whisper companion.
- BlogSpeculative Decoding for 2× Faster Whisper InferenceWhisper-large + Distil-Whisper draft = 2× speedup, bit-identical output. Same trick we cover for LLMs in [Speculative Decoding](../distillation/speculative-decoding).
- VideoWhisper.cpp: A Deep DiveThe author walks the codebase. Useful if you're extending the runtime.
- DocsOpenAI Whisper API DocsReference for the original cloud-API behaviors. Whisper.cpp matches the model semantics; the API surface differs.
TL;DR
- Whisper.cpp is Georgi Gerganov’s port of OpenAI Whisper to the same
ggmlruntime that powers llama.cpp. Quantized models (Q5_1 / Q8_0) hit real-time-or-better on a phone CPU; Metal / Vulkan / CUDA backends 2–4× that on a GPU. - The runtime exposes a tiny C API (
whisper_init_from_file,whisper_full,whisper_full_get_segment_text) and bindings for every language. ~1 MB binary, ~244 MB forsmall.enQ5_1. - Voice activity detection (VAD) is the unsexy piece that makes streaming actually work — Silero-VAD or webrtcvad runs in under 1% CPU and detects speech-vs-silence in 30 ms windows. Without VAD, you transcribe silence and waste battery.
- Chunking is the rest of the streaming story: feed Whisper 5–30 s windows with overlap, deduplicate the overlap, emit partial text as the latest chunk arrives. The Whisper.cpp
streamexample codifies the pattern. - The right model size:
small.en(244 MB) for English-only on phone,base(74 MB) if you’re memory-constrained,medium(~770 MB) on a laptop.large-v3only on a desktop GPU.
Why this matters
Voice is the killer phone interaction — typing on a touchscreen is the friction that AI assistants are supposed to remove. Cloud STT works but has three problems: latency (200–800 ms round trip), cost (~$0.006/min adds up), and privacy (your prompts cross the network). Whisper.cpp eliminates all three.
The 2026 reality: Whisper.cpp has been in production in serious apps for two years (MacWhisper, SuperWhisper, the iOS app Voice In Voice). It works. The lesson is how it works — because the same ggml patterns power every other on-device speech model coming next (Distil-Whisper, Moshi, Whisper-large-v3-turbo).
Mental model
Three buffer scales:
- 30 ms VAD frames — cheap classifier deciding whether to even spend CPU on this audio.
- 5–30 s Whisper window — the model’s native context. Below 5 s the encoder gets confused; above 30 s you’re past the model’s training context.
- Final transcript — concatenated, deduplicated text from all the windows.
Concrete walkthrough
What Whisper actually does internally
Whisper is an encoder-decoder transformer trained on 680K hours of multilingual audio. Two halves:
- Encoder: 80-channel log-mel spectrogram → 6/12/24 transformer blocks. Output:
(seq_len, d_model)features. Runs once per audio window. - Decoder: 6/12/24 transformer blocks with cross-attention to encoder output. Generates tokens autoregressively, one at a time, like any LLM. Tokens are byte-pair pieces over a 51,865-piece tokenizer.
The decoder is GPT-2-shaped. Whisper.cpp’s ggml runtime treats it identically to a Llama decode — same KV cache, same speculative-decoding tricks apply (you can speculative-decode Whisper-large with Distil-Whisper as the draft model).
The Whisper.cpp C API
// Load + warm up
struct whisper_context* ctx = whisper_init_from_file("ggml-small.en-q5_1.bin");
// Configure
struct whisper_full_params p = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
p.print_progress = false;
p.print_realtime = false;
p.language = "en";
p.translate = false;
p.n_threads = 4;
// Run on a 30s audio buffer (16kHz mono float)
whisper_full(ctx, p, audio_pcm, n_samples);
// Pull segments
const int n = whisper_full_n_segments(ctx);
for (int i = 0; i < n; i++) {
const char* text = whisper_full_get_segment_text(ctx, i);
int64_t t0 = whisper_full_get_segment_t0(ctx, i); // start, hundredths of seconds
int64_t t1 = whisper_full_get_segment_t1(ctx, i);
printf("[%lld..%lld] %s\n", t0, t1, text);
}
whisper_free(ctx);That’s the whole runtime API. Bindings exist for Swift, Kotlin, Python, JS — they all wrap this same C surface.
VAD: the streaming linchpin
A naive streaming impl runs Whisper on every 5-second window of audio. That’s ~1 W of CPU continuously. The fix: only run Whisper when there’s actually speech.
# pseudo-code with silero-vad
import silero_vad
vad = silero_vad.load() # ~1 MB model
buffer = []
in_speech = False
silence_ms = 0
for frame_30ms in audio_stream():
is_speech = vad(frame_30ms)
if is_speech:
buffer.append(frame_30ms)
in_speech = True
silence_ms = 0
elif in_speech:
silence_ms += 30
buffer.append(frame_30ms) # keep trailing silence (Whisper expects it)
if silence_ms > 600: # end of utterance
transcribe(buffer)
buffer = []
in_speech = FalseSilero-VAD: ~1 MB, under 0.1 ms per 30 ms frame on a phone CPU. webrtcvad is even smaller (a hand-tuned signal-processing classifier — no neural net). Either is fine.
The 600 ms silence threshold is the magic number — long enough to cleanly separate utterances, short enough that the user doesn’t feel the app is dragging.
Chunking with overlap
Whisper is trained on 30 s windows. For longer audio you need overlapping chunks:
- Chunk every 30 s with a 5 s overlap.
- Run Whisper on chunk N.
- The first 5 s of chunk N’s transcript should match the last 5 s of chunk N-1’s transcript. Use that to align and deduplicate.
- Emit the non-overlapping portion of each chunk to the user.
Whisper.cpp’s stream example handles all of this. For true low-latency streaming (token-by-token as the user speaks) the trick is to run Whisper on a sliding 5 s window with frequent re-runs — accept that the early text changes as more audio arrives. This is the SuperWhisper approach.
Quantization choices, in numbers
| Model | FP16 size | Q5_1 size | Q8_0 size | Phone tok/s | Quality drop vs FP16 |
|---|---|---|---|---|---|
tiny.en | 75 MB | 27 MB | 42 MB | 8× realtime | ~negligible |
base.en | 142 MB | 50 MB | 78 MB | 4× realtime | ~negligible |
small.en | 466 MB | 244 MB | 392 MB | 2× realtime | under 1 WER point |
medium | 1.5 GB | 770 MB | 1.2 GB | ~realtime on M2 only | under 1 WER |
large-v3 | 3 GB | 1.6 GB | 2.4 GB | desktop GPU only | best |
small.en Q5_1 is the production sweet spot for English-only mobile. base if you’re squeezed on RAM. The Distil-Whisper variants (distil-small.en, ~166 MB) match small.en quality at half the size and 2× the speed — released by Hugging Face in late 2023, fully ggml-compatible.
What “real-time” means honestly
“Real-time factor (RTF)” = (model latency) / (audio duration). RTF=1 means you can transcribe live. RTF=0.5 means twice as fast as live.
- Whisper.cpp
small.enQ5_1 on iPhone 15 Pro: RTF ~0.4–0.5 (live transcription with a small lag). - Same on iPhone 11: RTF ~0.8 (just barely keeps up).
mediumon iPhone 15 Pro: RTF ~1.2 (falls behind on long audio).mediumon M2 Air: RTF ~0.3 (very comfortable live).
The streaming UX target: end-to-end user-perceived lag under 500 ms. With VAD’s 600 ms post-silence delay + Whisper’s ~200 ms inference, you’re at 800 ms — which feels snappy but not instantaneous. The lower-bound is dominated by VAD’s silence-end detection, not Whisper.
Run it in your browser
A useful demo: simulate a streaming Whisper session — VAD-gated chunks, with quality-vs-latency trade-off visible. The math behind RTF.
The headline insight: VAD silence threshold dominates perceived latency more than the model size does, until you pick a model the device can’t keep up with.
Quick check
Key takeaways
- Whisper.cpp is Whisper running on the
ggmlruntime — same C-API DNA as llama.cpp, ships everywhere ggml ships. small.enQ5_1 is the production English mobile sweet spot: 244 MB, RTF ~0.4 on a recent phone.- VAD is required for streaming — Silero-VAD or webrtcvad. Cuts CPU by 5–10× and improves UX.
- 600 ms silence threshold is the magic UX number — short enough to feel snappy, long enough to cleanly separate utterances.
- Distil-Whisper is a same-quality, smaller, faster variant — production-ready in late 2023.
- Thermal throttling is the iOS gotcha — older chips fall behind after ~30 s of continuous transcription.
Go deeper
- Repoggerganov/whisper.cppThe reference implementation. Read `examples/stream/stream.cpp` for the canonical streaming pattern.
- PaperRobust Speech Recognition via Large-Scale Weak SupervisionThe original Whisper paper. Useful for the encoder/decoder architecture and the multilingual training story.
- PaperDistil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo LabellingDistil-Whisper details — same quality, half the params, 2× the speed. Fully ggml-compatible.
- Reposnakers4/silero-vadThe VAD model — 1 MB, real-time on phone CPU. The streaming-Whisper companion.
- BlogSpeculative Decoding for 2× Faster Whisper InferenceWhisper-large + Distil-Whisper draft = 2× speedup, bit-identical output. Same trick we cover for LLMs in [Speculative Decoding](../distillation/speculative-decoding).
- VideoWhisper.cpp: A Deep DiveThe author walks the codebase. Useful if you're extending the runtime.
- DocsOpenAI Whisper API DocsReference for the original cloud-API behaviors. Whisper.cpp matches the model semantics; the API surface differs.