Skip to content

Whisper.cpp & On-Device Speech

In a cloud-serving world, “transcribe this audio” is one HTTP POST. The latency budget is round-trip + a few hundred ms of GPU time, the cost is whatever OpenAI charges per minute, and the runtime is whichever Python container the cloud team set up. The user’s microphone bytes leave the device and come back as text.

That model has three problems on a phone. Latency: 200–800 ms of round-trip is enough to make voice UX feel laggy. Cost: even at $0.006/minute, an “always-on dictation” app loses money on every active user. Privacy: the user’s voice goes over the network. So the on-device speech world has converged on a different stack — Whisper.cpp, the same author’s -based runtime that powers llama.cpp, but tuned for an encoder-decoder model with audio in and tokens out.

The shape is familiar: Python is the SDK, C++ is what runs on the device, and the runtime is a single small library you link into your app. The new wrinkles are streaming (you don’t have the whole audio file when you start) and VAD (the unsexy classifier that decides whether to even spend CPU on this 30 ms of audio). Get those right and small.en Q5_1 transcribes faster than the user can talk on a 3-year-old phone.

TL;DR

  • Whisper.cpp is Georgi Gerganov’s port of OpenAI Whisper to the same ggml runtime that powers llama.cpp. Quantized models (Q5_1 / Q8_0) hit real-time-or-better on a phone CPU; Metal / Vulkan / CUDA backends 2–4× that on a GPU.
  • The runtime exposes a tiny C API (whisper_init_from_file, whisper_full, whisper_full_get_segment_text) and bindings for every language. ~1 MB binary, ~244 MB for small.en Q5_1.
  • Voice activity detection (VAD) is the unsexy piece that makes streaming actually work — Silero-VAD or webrtcvad runs in under 1% CPU and detects speech-vs-silence in 30 ms windows. Without VAD, you transcribe silence and waste battery.
  • Chunking is the rest of the streaming story: feed Whisper 5–30 s windows with overlap, deduplicate the overlap, emit partial text as the latest chunk arrives. The Whisper.cpp stream example codifies the pattern.
  • The right model size: small.en (244 MB) for English-only on phone, base (74 MB) if you’re memory-constrained, medium (~770 MB) on a laptop. large-v3 only on a desktop GPU.

Why this matters

Voice is the killer phone interaction — typing on a touchscreen is the friction that AI assistants are supposed to remove. Cloud STT works but has three problems: latency (200–800 ms round trip), cost (~$0.006/min adds up), and privacy (your prompts cross the network). Whisper.cpp eliminates all three.

The 2026 reality: Whisper.cpp has been in production in serious apps for two years (MacWhisper, SuperWhisper, the iOS app Voice In Voice). It works. The lesson is how it works — because the same ggml patterns power every other on-device speech model coming next (Distil-Whisper, Moshi, Whisper-large-v3-turbo).

Mental model

Three buffer scales:

  • 30 ms VAD frames — cheap classifier deciding whether to even spend CPU on this audio.
  • 5–30 s Whisper window — the model’s native context. Below 5 s the encoder gets confused; above 30 s you’re past the model’s training context.
  • Final transcript — concatenated, deduplicated text from all the windows.

What Whisper actually does internally

Whisper is an encoder-decoder transformer trained on 680K hours of multilingual audio. Two halves:

  1. Encoder: 80-channel log-mel spectrogram → 6/12/24 transformer blocks. Output: (seq_len, d_model) features. Runs once per audio window.
  2. Decoder: 6/12/24 transformer blocks with cross-attention to encoder output. Generates tokens autoregressively, one at a time, like any LLM. Tokens are byte-pair pieces over a 51,865-piece tokenizer.

The decoder is GPT-2-shaped. Whisper.cpp’s ggml runtime treats it identically to a Llama decode — same , same speculative-decoding tricks apply (you can speculative-decode Whisper-large with Distil-Whisper as the draft model).

The Whisper.cpp C API

This is what runs on the device — a few function calls into a static library:

// Load + warm up struct whisper_context* ctx = whisper_init_from_file("ggml-small.en-q5_1.bin"); // Configure struct whisper_full_params p = whisper_full_default_params(WHISPER_SAMPLING_GREEDY); p.print_progress = false; p.print_realtime = false; p.language = "en"; p.translate = false; p.n_threads = 4; // Run on a 30s audio buffer (16kHz mono float) whisper_full(ctx, p, audio_pcm, n_samples); // Pull segments const int n = whisper_full_n_segments(ctx); for (int i = 0; i < n; i++) { const char* text = whisper_full_get_segment_text(ctx, i); int64_t t0 = whisper_full_get_segment_t0(ctx, i); // start, hundredths of seconds int64_t t1 = whisper_full_get_segment_t1(ctx, i); printf("[%lld..%lld] %s\n", t0, t1, text); } whisper_free(ctx);

That’s the whole runtime API. Bindings exist for Swift, Kotlin, Python, JS — they all wrap this same C surface.

VAD: the streaming linchpin

A naive streaming impl runs Whisper on every 5-second window of audio. That’s ~1 W of CPU continuously. The fix: only run Whisper when there’s actually speech.

The orchestration code is usually written in the host language — Python in the desktop case, Swift/Kotlin on phones. The math is light enough that it doesn’t matter:

# pseudo-code with silero-vad import silero_vad vad = silero_vad.load() # ~1 MB model buffer = [] in_speech = False silence_ms = 0 for frame_30ms in audio_stream(): is_speech = vad(frame_30ms) if is_speech: buffer.append(frame_30ms) in_speech = True silence_ms = 0 elif in_speech: silence_ms += 30 buffer.append(frame_30ms) # keep trailing silence (Whisper expects it) if silence_ms > 600: # end of utterance transcribe(buffer) buffer = [] in_speech = False

Silero-VAD: ~1 MB, under 0.1 ms per 30 ms frame on a phone CPU. webrtcvad is even smaller (a hand-tuned signal-processing classifier — no neural net). Either is fine.

The 600 ms silence threshold is the magic number — long enough to cleanly separate utterances, short enough that the user doesn’t feel the app is dragging.

Chunking with overlap

Whisper is trained on 30 s windows. For longer audio you need overlapping chunks:

  • Chunk every 30 s with a 5 s overlap.
  • Run Whisper on chunk N.
  • The first 5 s of chunk N’s transcript should match the last 5 s of chunk N-1’s transcript. Use that to align and deduplicate.
  • Emit the non-overlapping portion of each chunk to the user.

Whisper.cpp’s stream example handles all of this. For true low-latency streaming (token-by-token as the user speaks) the trick is to run Whisper on a sliding 5 s window with frequent re-runs — accept that the early text changes as more audio arrives. This is the SuperWhisper approach.

Quantization choices, in numbers

ModelFP16 sizeQ5_1 sizeQ8_0 sizePhone tok/sQuality drop vs FP16
tiny.en75 MB27 MB42 MB8× realtime~negligible
base.en142 MB50 MB78 MB4× realtime~negligible
small.en466 MB244 MB392 MB2× realtimeunder 1 WER point
medium1.5 GB770 MB1.2 GB~realtime on M2 onlyunder 1 WER
large-v33 GB1.6 GB2.4 GBdesktop GPU onlybest

small.en Q5_1 is the production sweet spot for English-only mobile. base if you’re squeezed on RAM. The Distil-Whisper variants (distil-small.en, ~166 MB) match small.en quality at half the size and 2× the speed — released by Hugging Face in late 2023, fully ggml-compatible.

What “real-time” means honestly

“Real-time factor (RTF)” = (model latency) / (audio duration). RTF=1 means you can transcribe live. RTF=0.5 means twice as fast as live.

  • Whisper.cpp small.en Q5_1 on iPhone 15 Pro: RTF ~0.4–0.5 (live transcription with a small lag).
  • Same on iPhone 11: RTF ~0.8 (just barely keeps up).
  • medium on iPhone 15 Pro: RTF ~1.2 (falls behind on long audio).
  • medium on M2 Air: RTF ~0.3 (very comfortable live).

The streaming UX target: end-to-end user-perceived lag under 500 ms. With VAD’s 600 ms post-silence delay + Whisper’s ~200 ms inference, you’re at 800 ms — which feels snappy but not instantaneous. The lower-bound is dominated by VAD’s silence-end detection, not Whisper.

Run it in your browser

A useful demo: simulate a streaming Whisper session — VAD-gated chunks, with quality-vs-latency trade-off visible. The math behind RTF.

Python — editableTweak the model and the audio to see end-to-end perceived latency. VAD silence threshold dominates UX more than model speed does.
Ctrl+Enter to run

The headline insight: VAD silence threshold dominates perceived latency more than the model size does, until you pick a model the device can’t keep up with.

Quick check

Quick check
You ship a Whisper.cpp streaming feature on iPhone. Users on iPhone 11 (A13) report 'it works for the first 30 seconds then gets really slow'. iPhone 15 (A17) users have no issue. What's the most likely cause and fix?

Key takeaways

  1. Whisper.cpp is Whisper running on the ggml runtime — same C-API DNA as llama.cpp, ships everywhere ggml ships.
  2. small.en Q5_1 is the production English mobile sweet spot: 244 MB, RTF ~0.4 on a recent phone.
  3. VAD is required for streaming — Silero-VAD or webrtcvad. Cuts CPU by 5–10× and improves UX.
  4. 600 ms silence threshold is the magic UX number — short enough to feel snappy, long enough to cleanly separate utterances.
  5. Distil-Whisper is a same-quality, smaller, faster variant — production-ready in late 2023.
  6. Thermal throttling is the iOS gotcha — older chips fall behind after ~30 s of continuous transcription.

Go deeper

TL;DR

  • Whisper.cpp is Georgi Gerganov’s port of OpenAI Whisper to the same ggml runtime that powers llama.cpp. Quantized models (Q5_1 / Q8_0) hit real-time-or-better on a phone CPU; Metal / Vulkan / CUDA backends 2–4× that on a GPU.
  • The runtime exposes a tiny C API (whisper_init_from_file, whisper_full, whisper_full_get_segment_text) and bindings for every language. ~1 MB binary, ~244 MB for small.en Q5_1.
  • Voice activity detection (VAD) is the unsexy piece that makes streaming actually work — Silero-VAD or webrtcvad runs in under 1% CPU and detects speech-vs-silence in 30 ms windows. Without VAD, you transcribe silence and waste battery.
  • Chunking is the rest of the streaming story: feed Whisper 5–30 s windows with overlap, deduplicate the overlap, emit partial text as the latest chunk arrives. The Whisper.cpp stream example codifies the pattern.
  • The right model size: small.en (244 MB) for English-only on phone, base (74 MB) if you’re memory-constrained, medium (~770 MB) on a laptop. large-v3 only on a desktop GPU.

Why this matters

Voice is the killer phone interaction — typing on a touchscreen is the friction that AI assistants are supposed to remove. Cloud STT works but has three problems: latency (200–800 ms round trip), cost (~$0.006/min adds up), and privacy (your prompts cross the network). Whisper.cpp eliminates all three.

The 2026 reality: Whisper.cpp has been in production in serious apps for two years (MacWhisper, SuperWhisper, the iOS app Voice In Voice). It works. The lesson is how it works — because the same ggml patterns power every other on-device speech model coming next (Distil-Whisper, Moshi, Whisper-large-v3-turbo).

Mental model

Three buffer scales:

  • 30 ms VAD frames — cheap classifier deciding whether to even spend CPU on this audio.
  • 5–30 s Whisper window — the model’s native context. Below 5 s the encoder gets confused; above 30 s you’re past the model’s training context.
  • Final transcript — concatenated, deduplicated text from all the windows.

Concrete walkthrough

What Whisper actually does internally

Whisper is an encoder-decoder transformer trained on 680K hours of multilingual audio. Two halves:

  1. Encoder: 80-channel log-mel spectrogram → 6/12/24 transformer blocks. Output: (seq_len, d_model) features. Runs once per audio window.
  2. Decoder: 6/12/24 transformer blocks with cross-attention to encoder output. Generates tokens autoregressively, one at a time, like any LLM. Tokens are byte-pair pieces over a 51,865-piece tokenizer.

The decoder is GPT-2-shaped. Whisper.cpp’s ggml runtime treats it identically to a Llama decode — same KV cache, same speculative-decoding tricks apply (you can speculative-decode Whisper-large with Distil-Whisper as the draft model).

The Whisper.cpp C API

// Load + warm up struct whisper_context* ctx = whisper_init_from_file("ggml-small.en-q5_1.bin"); // Configure struct whisper_full_params p = whisper_full_default_params(WHISPER_SAMPLING_GREEDY); p.print_progress = false; p.print_realtime = false; p.language = "en"; p.translate = false; p.n_threads = 4; // Run on a 30s audio buffer (16kHz mono float) whisper_full(ctx, p, audio_pcm, n_samples); // Pull segments const int n = whisper_full_n_segments(ctx); for (int i = 0; i < n; i++) { const char* text = whisper_full_get_segment_text(ctx, i); int64_t t0 = whisper_full_get_segment_t0(ctx, i); // start, hundredths of seconds int64_t t1 = whisper_full_get_segment_t1(ctx, i); printf("[%lld..%lld] %s\n", t0, t1, text); } whisper_free(ctx);

That’s the whole runtime API. Bindings exist for Swift, Kotlin, Python, JS — they all wrap this same C surface.

VAD: the streaming linchpin

A naive streaming impl runs Whisper on every 5-second window of audio. That’s ~1 W of CPU continuously. The fix: only run Whisper when there’s actually speech.

# pseudo-code with silero-vad import silero_vad vad = silero_vad.load() # ~1 MB model buffer = [] in_speech = False silence_ms = 0 for frame_30ms in audio_stream(): is_speech = vad(frame_30ms) if is_speech: buffer.append(frame_30ms) in_speech = True silence_ms = 0 elif in_speech: silence_ms += 30 buffer.append(frame_30ms) # keep trailing silence (Whisper expects it) if silence_ms > 600: # end of utterance transcribe(buffer) buffer = [] in_speech = False

Silero-VAD: ~1 MB, under 0.1 ms per 30 ms frame on a phone CPU. webrtcvad is even smaller (a hand-tuned signal-processing classifier — no neural net). Either is fine.

The 600 ms silence threshold is the magic number — long enough to cleanly separate utterances, short enough that the user doesn’t feel the app is dragging.

Chunking with overlap

Whisper is trained on 30 s windows. For longer audio you need overlapping chunks:

  • Chunk every 30 s with a 5 s overlap.
  • Run Whisper on chunk N.
  • The first 5 s of chunk N’s transcript should match the last 5 s of chunk N-1’s transcript. Use that to align and deduplicate.
  • Emit the non-overlapping portion of each chunk to the user.

Whisper.cpp’s stream example handles all of this. For true low-latency streaming (token-by-token as the user speaks) the trick is to run Whisper on a sliding 5 s window with frequent re-runs — accept that the early text changes as more audio arrives. This is the SuperWhisper approach.

Quantization choices, in numbers

ModelFP16 sizeQ5_1 sizeQ8_0 sizePhone tok/sQuality drop vs FP16
tiny.en75 MB27 MB42 MB8× realtime~negligible
base.en142 MB50 MB78 MB4× realtime~negligible
small.en466 MB244 MB392 MB2× realtimeunder 1 WER point
medium1.5 GB770 MB1.2 GB~realtime on M2 onlyunder 1 WER
large-v33 GB1.6 GB2.4 GBdesktop GPU onlybest

small.en Q5_1 is the production sweet spot for English-only mobile. base if you’re squeezed on RAM. The Distil-Whisper variants (distil-small.en, ~166 MB) match small.en quality at half the size and 2× the speed — released by Hugging Face in late 2023, fully ggml-compatible.

What “real-time” means honestly

“Real-time factor (RTF)” = (model latency) / (audio duration). RTF=1 means you can transcribe live. RTF=0.5 means twice as fast as live.

  • Whisper.cpp small.en Q5_1 on iPhone 15 Pro: RTF ~0.4–0.5 (live transcription with a small lag).
  • Same on iPhone 11: RTF ~0.8 (just barely keeps up).
  • medium on iPhone 15 Pro: RTF ~1.2 (falls behind on long audio).
  • medium on M2 Air: RTF ~0.3 (very comfortable live).

The streaming UX target: end-to-end user-perceived lag under 500 ms. With VAD’s 600 ms post-silence delay + Whisper’s ~200 ms inference, you’re at 800 ms — which feels snappy but not instantaneous. The lower-bound is dominated by VAD’s silence-end detection, not Whisper.

Run it in your browser

A useful demo: simulate a streaming Whisper session — VAD-gated chunks, with quality-vs-latency trade-off visible. The math behind RTF.

Python — editableTweak the model and the audio to see end-to-end perceived latency. VAD silence threshold dominates UX more than model speed does.
Ctrl+Enter to run

The headline insight: VAD silence threshold dominates perceived latency more than the model size does, until you pick a model the device can’t keep up with.

Quick check

Quick check
You ship a Whisper.cpp streaming feature on iPhone. Users on iPhone 11 (A13) report 'it works for the first 30 seconds then gets really slow'. iPhone 15 (A17) users have no issue. What's the most likely cause and fix?

Key takeaways

  1. Whisper.cpp is Whisper running on the ggml runtime — same C-API DNA as llama.cpp, ships everywhere ggml ships.
  2. small.en Q5_1 is the production English mobile sweet spot: 244 MB, RTF ~0.4 on a recent phone.
  3. VAD is required for streaming — Silero-VAD or webrtcvad. Cuts CPU by 5–10× and improves UX.
  4. 600 ms silence threshold is the magic UX number — short enough to feel snappy, long enough to cleanly separate utterances.
  5. Distil-Whisper is a same-quality, smaller, faster variant — production-ready in late 2023.
  6. Thermal throttling is the iOS gotcha — older chips fall behind after ~30 s of continuous transcription.

Go deeper