WebGPU & WebLLM
In a cloud-serving stack, “ship to every device on the internet” is a punchline, not a goal. You ship to a Linux container in someone’s data center; the user’s device runs a thin client that talks HTTPS. The runtime, the model, the GPU all live behind your API.
The browser is the only target where “ship to every device” is literal. A static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side, in any modern browser, on any device with a GPU — Mac, Windows, Linux, iPhone, Android, Chromebook. No server, no API key, no rate limit, no privacy concerns. The model lives in IndexedDB; the app keeps working in airplane mode.
The trick that makes this work is WebGPU — a W3C standard that exposes the device’s GPU through a common shader language called . WebGPU compiles the same WGSL kernel to Metal on Apple, D3D12 on Windows, Vulkan on Linux/Android. WebLLM (the @mlc-ai/web-llm library) builds on top of that: PyTorch checkpoints get compiled (offline, by the MLC team) into WGSL kernel bundles + quantized weights, and the browser tab JIT-compiles the WGSL on first run. The Python is the SDK that produces the kernel bundle; what runs on the device is TypeScript orchestration calling WGSL on whichever GPU stack the OS provides.
That’s the strange and useful shape of this stack: the production runtime isn’t a binary you ship — it’s a webpage that loads ~750 MB of weights into IndexedDB, compiles a hundred shaders, and starts streaming tokens. The compute happens on the user’s GPU, costs you nothing, and scales linearly with users for free.
TL;DR
- WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device’s GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
- WebLLM (
@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB. - transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
- WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
- The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.
Why this matters
Every other on-device path requires a native app. The browser is the only target where you can ship to “every device that exists on the internet” with a single static deploy. The 2026 inflection: WebGPU on iOS Safari shipped in 2024, on Android Chrome in 2023, on M-series Macs forever. Suddenly every shipping device can run a small LLM in a tab.
This unlocks:
- Privacy-by-default LLM apps — a code-explainer extension, a journaling app, a kid’s tutor — where prompts literally never leave the device.
- Offline-capable PWAs — the model lives in IndexedDB; the app keeps working in airplane mode.
- Zero-cost demos — a $0/month static site with a working LLM that scales to a million users (the compute is on their devices).
- Embed-anywhere widgets — a doc page on Stripe.com can run an LLM-powered help widget without round-tripping to OpenAI.
The trade-off: the browser is a sandbox. You get ~50% of native performance and you pay a 5–60 second cold-load penalty. Both shrink every quarter.
Mental model
Two layers matter. The compile pipeline runs once when the model is built: PyTorch checkpoint → MLC’s TVM compiler → WGSL kernels + a metadata file. The runtime runs on each tab open: download the kernels + weights (or hit the IndexedDB cache), JIT-compile the WGSL on the device’s GPU, run inference. The first compile pass is where Chrome freezes for 3–5 s the first time someone opens your app.
What’s actually in @mlc-ai/web-llm
The user surface is TypeScript — that’s what real apps call. The kernels underneath are WGSL.
import { CreateMLCEngine } from "@mlc-ai/web-llm"
const engine = await CreateMLCEngine(
"Llama-3.2-1B-Instruct-q4f16_1-MLC",
{
initProgressCallback: (p) => {
// p.progress is a float 0..1; p.text is "Fetching weights / Compiling…"
console.log(p.text, p.progress)
},
}
)
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Why is the sky blue?" }],
stream: true,
})
for await (const chunk of response) {
const delta = chunk.choices[0].delta.content
process.stdout.write(delta || "")
}That’s the whole API. It’s deliberately OpenAI-compatible — chat.completions.create looks identical to openai.chat.completions.create. You can swap an OpenAI call for a local one by changing the import.
A glance at the WGSL kernel underneath
When the engine “compiles shaders” on first run, this is what’s getting compiled — a WGSL compute shader that looks a lot like CUDA, but in W3C-standard syntax:
// A simplified Q4 dequant + matmul tile in WGSL.
@group(0) @binding(0) var<storage, read> weights_q4 : array<u32>; // packed 4-bit weights
@group(0) @binding(1) var<storage, read> scales : array<f16>; // per-group scale
@group(0) @binding(2) var<storage, read> input : array<f16>; // activations
@group(0) @binding(3) var<storage, read_write> out : array<f16>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
let row = gid.x;
var acc : f32 = 0.0;
for (var k : u32 = 0u; k < D; k = k + 1u) {
// Unpack two int4 weights from one u32 byte.
let packed = weights_q4[(row * D + k) >> 3];
let w_q = (packed >> ((k & 7u) * 4u)) & 0xFu;
let w = (f32(w_q) - 8.0) * f32(scales[(row * D + k) / 32u]);
acc = acc + w * f32(input[k]);
}
out[row] = f16(acc);
}That kernel is the same shape you’d write in CUDA C++ — workgroups, shared bindings, an inner-product loop — except the device that runs it is whatever GPU the user happens to have.
What CreateMLCEngine actually does on first run
- Fetches
mlc-chat-config.json(model metadata: vocab size, context length, quantization). - Fetches
*.binshards (the quantized weights — ~750 MB for 1B Q4F16, in 50 MB shards). Streams to IndexedDB as they download. - Fetches
*.wasm(the model’s WGSL kernels packaged as a WebAssembly module). - Initializes a WebGPU device:
navigator.gpu.requestAdapter()→adapter.requestDevice(). - The freeze: compiles every WGSL shader. ~50–200 shaders per model. Browser shows the tab as unresponsive for 3–5 s on first load.
- Loads weights from IndexedDB into GPU buffers.
- Returns the engine.
On subsequent loads (same model_id + same browser), steps 1–3 hit IndexedDB instead of network. Total cold-load: 30–60 s on broadband. Warm-load: 2–5 s.
Quantization in the browser
The MLC compiler ships q4f16_1 as the default: 4-bit weights + FP16 activations. This is roughly equivalent to — group-wise asymmetric quantization with FP16 zero-points.
The size math:
| Param | Bytes/param | Llama-3.2-1B size | Llama-3.2-7B size |
|---|---|---|---|
| FP16 (full) | 2 | 2.4 GB | ~14 GB |
| Q4F16 (MLC default) | ~0.5 | 750 MB | ~4 GB |
| Q4F16 + KV at FP16 | ~0.5 | ~1.0 GB working | ~5.5 GB working |
7B at 4 bits just fits on a phone with 6 GB usable memory; 1B is comfortable on anything.
WebGPU’s actual performance
The honest numbers, from MLC-LLM’s own benchmarks on Llama-3.2-1B Q4F16:
| Device | Browser | Tok/s |
|---|---|---|
| M2 MacBook Air | Chrome 130 | 70–80 |
| M2 MacBook Air | Safari 18 | 50–60 |
| iPhone 15 Pro | Safari 18 | 18–22 |
| Pixel 8 | Chrome 130 | 14–18 |
| $300 Chromebook (Intel UHD) | Chrome 130 | 4–6 |
For 7B Q4F16: roughly half the tok/s, but you also need the device to fit it in memory.
The throttle is GPU memory bandwidth, not compute. WebGPU adds ~20% overhead vs. native MLC (which uses Metal/Vulkan directly), mostly from API cost and the WGSL shader-compile constraint that some optimizations native compilers do (e.g. CUDA’s tiled-MM intrinsics) aren’t expressible in WGSL.
WebNN — the next-generation alternative
WebGPU asks you to write shaders. WebNN asks you to describe the model as ops:
const builder = new MLGraphBuilder(context)
const a = builder.input('a', { dataType: 'float32', shape: [1, 768] })
const b = builder.constant({ dataType: 'float32', shape: [768, 768] }, weights)
const c = builder.matmul(a, b)
const graph = await builder.build({ c })The browser then routes the graph to the platform’s accelerator: ANE on macOS, DirectML→NPU on Windows, eventually NNAPI on Android. WebNN runs 2–4× faster than WebGPU on devices with a real NPU because it bypasses the GPU entirely.
The catch: WebNN is shipping in Chrome 130+ on Windows + macOS as of 2026, but model coverage is still small (no Llama yet — only the basic CV/CNN models). The MLC team has stated they’ll add a WebNN backend once the op coverage stabilizes. Watch this space.
Choosing between WebLLM, transformers.js, and WebNN
| If you want… | Use |
|---|---|
| A chatbot UI with Llama / Phi / Mistral / Gemma | WebLLM |
| Whisper, BERT, CLIP, segmentation, embeddings | transformers.js |
| The fastest LLM inference on a 2026 device with an NPU | WebNN (when the model you want is supported; otherwise WebLLM) |
| The smallest possible bundle for a one-off task | transformers.js with the WASM-only backend (no WebGPU) |
WebLLM is the right default for new chat-style apps. transformers.js is the right default for everything else.
Run it in your browser
A practical demo: simulate WebLLM’s IndexedDB cache behavior. The browser stores model shards keyed by URL; subsequent loads skip the network entirely.
The headline insight from the math: cold-load on a phone over LTE is unusable for 7B (~10 minutes). 1B over broadband is fine (~2 minutes). For most apps, target 1B–3B and accept the cold-load cost on first visit; warm-load is always fast.
Quick check
Key takeaways
- WebGPU is the W3C-standard universal GPU — same WGSL kernel runs on Metal, D3D12, and Vulkan. Stable in every modern browser.
- WebLLM (
@mlc-ai/web-llm) is the production-ready way to ship Llama / Phi / Mistral / Gemma to a browser tab. OpenAI-compatible API. - transformers.js covers everything else (Whisper, CLIP, BERT, segmentation) but is slower for LLMs.
- The cold-load cost is real — 30–60 s for 1B on broadband — but warm-load is 2–5 s and IndexedDB-cached.
- WebNN is the future for LLM inference on devices with NPUs (2–4× faster than WebGPU), but op coverage in 2026 is still limited.
- A static deploy on GitHub Pages can serve a working LLM to a million users for $0/month.
Go deeper
- DocsWebLLM DocsAPI reference, prebuilt model list, browser compat matrix.
- DocsWebGPU SpecificationThe W3C standard. Skim §2 (concepts) and §13 (WGSL).
- Docstransformers.js DocsThe model zoo + pipelines; the WASM/WebGPU runtime story.
- PaperMLC-LLM: Universal LLM Deployment Engine With ML CompilationThe compiler stack behind WebLLM — TVM Unity, kernel generation, the WebGPU backend.
- DocsWebNN SpecificationThe next-generation NN API. Skim the op list to see what models are reachable.
- BlogGPU-Accelerated LLM on a $50 Orange Pi via WebGPUConcrete numbers from the smallest device class. Hits the universality theme directly.
- Repomlc-ai/web-llmThe reference implementation; read `src/engine.ts` for the full lifecycle.
- VideoWebGPU Inference: From Shader to TokenThe technical talk for the algorithmic-deep-dive crowd.
TL;DR
- WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device’s GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
- WebLLM (
@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB. - transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
- WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
- The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.
Why this matters
Every other on-device path requires a native app. The browser is the only target where you can ship to “every device that exists on the internet” with a single static deploy. The 2026 inflection: WebGPU on iOS Safari shipped in 2024, on Android Chrome in 2023, on M-series Macs forever. Suddenly every shipping device can run a small LLM in a tab.
This unlocks:
- Privacy-by-default LLM apps — a code-explainer extension, a journaling app, a kid’s tutor — where prompts literally never leave the device.
- Offline-capable PWAs — the model lives in IndexedDB; the app keeps working in airplane mode.
- Zero-cost demos — a $0/month static site with a working LLM that scales to a million users (the compute is on their devices).
- Embed-anywhere widgets — a doc page on Stripe.com can run an LLM-powered help widget without round-tripping to OpenAI.
The trade-off: the browser is a sandbox. You get ~50% of native performance and you pay a 5–60 second cold-load penalty. Both shrink every quarter.
Mental model
Two layers matter. The compile pipeline runs once when the model is built: PyTorch checkpoint → MLC’s TVM compiler → WGSL kernels + a metadata file. The runtime runs on each tab open: download the kernels + weights (or hit the IndexedDB cache), JIT-compile the WGSL on the device’s GPU, run inference. The first compile pass is where Chrome freezes for 3–5 s the first time someone opens your app.
Concrete walkthrough
What’s actually in @mlc-ai/web-llm
import { CreateMLCEngine } from "@mlc-ai/web-llm"
const engine = await CreateMLCEngine(
"Llama-3.2-1B-Instruct-q4f16_1-MLC",
{
initProgressCallback: (p) => {
// p.progress is a float 0..1; p.text is "Fetching weights / Compiling…"
console.log(p.text, p.progress)
},
}
)
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Why is the sky blue?" }],
stream: true,
})
for await (const chunk of response) {
const delta = chunk.choices[0].delta.content
process.stdout.write(delta || "")
}That’s the whole API. It’s deliberately OpenAI-compatible — chat.completions.create looks identical to openai.chat.completions.create. You can swap an OpenAI call for a local one by changing the import.
What CreateMLCEngine actually does on first run
- Fetches
mlc-chat-config.json(model metadata: vocab size, context length, quantization). - Fetches
*.binshards (the quantized weights — ~750 MB for 1B Q4F16, in 50 MB shards). Streams to IndexedDB as they download. - Fetches
*.wasm(the model’s WGSL kernels packaged as a WebAssembly module). - Initializes a WebGPU device:
navigator.gpu.requestAdapter()→adapter.requestDevice(). - The freeze: compiles every WGSL shader. ~50–200 shaders per model. Browser shows the tab as unresponsive for 3–5 s on first load.
- Loads weights from IndexedDB into GPU buffers.
- Returns the engine.
On subsequent loads (same model_id + same browser), steps 1–3 hit IndexedDB instead of network. Total cold-load: 30–60 s on broadband. Warm-load: 2–5 s.
Quantization in the browser
The MLC compiler ships q4f16_1 as the default: 4-bit weights + FP16 activations. This is roughly equivalent to AWQ — group-wise asymmetric quantization with FP16 zero-points.
The size math:
| Param | Bytes/param | Llama-3.2-1B size | Llama-3.2-7B size |
|---|---|---|---|
| FP16 (full) | 2 | 2.4 GB | ~14 GB |
| Q4F16 (MLC default) | ~0.5 | 750 MB | ~4 GB |
| Q4F16 + KV at FP16 | ~0.5 | ~1.0 GB working | ~5.5 GB working |
7B at 4 bits just fits on a phone with 6 GB usable memory; 1B is comfortable on anything.
A glance at the WGSL kernel underneath
// A simplified Q4 dequant + matmul tile in WGSL.
@group(0) @binding(0) var<storage, read> weights_q4 : array<u32>; // packed 4-bit weights
@group(0) @binding(1) var<storage, read> scales : array<f16>; // per-group scale
@group(0) @binding(2) var<storage, read> input : array<f16>; // activations
@group(0) @binding(3) var<storage, read_write> out : array<f16>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
let row = gid.x;
var acc : f32 = 0.0;
for (var k : u32 = 0u; k < D; k = k + 1u) {
let packed = weights_q4[(row * D + k) >> 3];
let w_q = (packed >> ((k & 7u) * 4u)) & 0xFu;
let w = (f32(w_q) - 8.0) * f32(scales[(row * D + k) / 32u]);
acc = acc + w * f32(input[k]);
}
out[row] = f16(acc);
}That kernel is the same shape you’d write in CUDA C++ — workgroups, shared bindings, an inner-product loop — except the device that runs it is whatever GPU the user happens to have.
WebGPU’s actual performance
The honest numbers, from MLC-LLM’s own benchmarks on Llama-3.2-1B Q4F16:
| Device | Browser | Tok/s |
|---|---|---|
| M2 MacBook Air | Chrome 130 | 70–80 |
| M2 MacBook Air | Safari 18 | 50–60 |
| iPhone 15 Pro | Safari 18 | 18–22 |
| Pixel 8 | Chrome 130 | 14–18 |
| $300 Chromebook (Intel UHD) | Chrome 130 | 4–6 |
For 7B Q4F16: roughly half the tok/s, but you also need the device to fit it in memory.
The throttle is GPU memory bandwidth, not compute. WebGPU adds ~20% overhead vs. native MLC (which uses Metal/Vulkan directly), mostly from API cost and the WGSL shader-compile constraint that some optimizations native compilers do (e.g. CUDA’s tiled-MM intrinsics) aren’t expressible in WGSL.
WebNN — the next-generation alternative
WebGPU asks you to write shaders. WebNN asks you to describe the model as ops:
const builder = new MLGraphBuilder(context)
const a = builder.input('a', { dataType: 'float32', shape: [1, 768] })
const b = builder.constant({ dataType: 'float32', shape: [768, 768] }, weights)
const c = builder.matmul(a, b)
const graph = await builder.build({ c })The browser then routes the graph to the platform’s accelerator: ANE on macOS, DirectML→NPU on Windows, eventually NNAPI on Android. WebNN runs 2–4× faster than WebGPU on devices with a real NPU because it bypasses the GPU entirely.
The catch: WebNN is shipping in Chrome 130+ on Windows + macOS as of 2026, but model coverage is still small (no Llama yet — only the basic CV/CNN models). The MLC team has stated they’ll add a WebNN backend once the op coverage stabilizes. Watch this space.
Choosing between WebLLM, transformers.js, and WebNN
| If you want… | Use |
|---|---|
| A chatbot UI with Llama / Phi / Mistral / Gemma | WebLLM |
| Whisper, BERT, CLIP, segmentation, embeddings | transformers.js |
| The fastest LLM inference on a 2026 device with an NPU | WebNN (when the model you want is supported; otherwise WebLLM) |
| The smallest possible bundle for a one-off task | transformers.js with the WASM-only backend (no WebGPU) |
WebLLM is the right default for new chat-style apps. transformers.js is the right default for everything else.
Run it in your browser
A practical demo: simulate WebLLM’s IndexedDB cache behavior. The browser stores model shards keyed by URL; subsequent loads skip the network entirely.
The headline insight from the math: cold-load on a phone over LTE is unusable for 7B (~10 minutes). 1B over broadband is fine (~2 minutes). For most apps, target 1B–3B and accept the cold-load cost on first visit; warm-load is always fast.
Quick check
Key takeaways
- WebGPU is the W3C-standard universal GPU — same WGSL kernel runs on Metal, D3D12, and Vulkan. Stable in every modern browser.
- WebLLM (
@mlc-ai/web-llm) is the production-ready way to ship Llama / Phi / Mistral / Gemma to a browser tab. OpenAI-compatible API. - transformers.js covers everything else (Whisper, CLIP, BERT, segmentation) but is slower for LLMs.
- The cold-load cost is real — 30–60 s for 1B on broadband — but warm-load is 2–5 s and IndexedDB-cached.
- WebNN is the future for LLM inference on devices with NPUs (2–4× faster than WebGPU), but op coverage in 2026 is still limited.
- A static deploy on GitHub Pages can serve a working LLM to a million users for $0/month.
Go deeper
- DocsWebLLM DocsAPI reference, prebuilt model list, browser compat matrix.
- DocsWebGPU SpecificationThe W3C standard. Skim §2 (concepts) and §13 (WGSL).
- Docstransformers.js DocsThe model zoo + pipelines; the WASM/WebGPU runtime story.
- PaperMLC-LLM: Universal LLM Deployment Engine With ML CompilationThe compiler stack behind WebLLM — TVM Unity, kernel generation, the WebGPU backend.
- DocsWebNN SpecificationThe next-generation NN API. Skim the op list to see what models are reachable.
- BlogGPU-Accelerated LLM on a $50 Orange Pi via WebGPUConcrete numbers from the smallest device class. Hits the universality theme directly.
- Repomlc-ai/web-llmThe reference implementation; read `src/engine.ts` for the full lifecycle.
- VideoWebGPU Inference: From Shader to TokenThe technical talk for the algorithmic-deep-dive crowd.