Skip to content

WebGPU & WebLLM

In a cloud-serving stack, “ship to every device on the internet” is a punchline, not a goal. You ship to a Linux container in someone’s data center; the user’s device runs a thin client that talks HTTPS. The runtime, the model, the GPU all live behind your API.

The browser is the only target where “ship to every device” is literal. A static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side, in any modern browser, on any device with a GPU — Mac, Windows, Linux, iPhone, Android, Chromebook. No server, no API key, no rate limit, no privacy concerns. The model lives in IndexedDB; the app keeps working in airplane mode.

The trick that makes this work is WebGPU — a W3C standard that exposes the device’s GPU through a common shader language called . WebGPU compiles the same WGSL kernel to Metal on Apple, D3D12 on Windows, Vulkan on Linux/Android. WebLLM (the @mlc-ai/web-llm library) builds on top of that: PyTorch checkpoints get compiled (offline, by the MLC team) into WGSL kernel bundles + quantized weights, and the browser tab JIT-compiles the WGSL on first run. The Python is the SDK that produces the kernel bundle; what runs on the device is TypeScript orchestration calling WGSL on whichever GPU stack the OS provides.

That’s the strange and useful shape of this stack: the production runtime isn’t a binary you ship — it’s a webpage that loads ~750 MB of weights into IndexedDB, compiles a hundred shaders, and starts streaming tokens. The compute happens on the user’s GPU, costs you nothing, and scales linearly with users for free.

TL;DR

  • WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device’s GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
  • WebLLM (@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB.
  • transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
  • WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
  • The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.

Why this matters

Every other on-device path requires a native app. The browser is the only target where you can ship to “every device that exists on the internet” with a single static deploy. The 2026 inflection: WebGPU on iOS Safari shipped in 2024, on Android Chrome in 2023, on M-series Macs forever. Suddenly every shipping device can run a small LLM in a tab.

This unlocks:

  • Privacy-by-default LLM apps — a code-explainer extension, a journaling app, a kid’s tutor — where prompts literally never leave the device.
  • Offline-capable PWAs — the model lives in IndexedDB; the app keeps working in airplane mode.
  • Zero-cost demos — a $0/month static site with a working LLM that scales to a million users (the compute is on their devices).
  • Embed-anywhere widgets — a doc page on Stripe.com can run an LLM-powered help widget without round-tripping to OpenAI.

The trade-off: the browser is a sandbox. You get ~50% of native performance and you pay a 5–60 second cold-load penalty. Both shrink every quarter.

Mental model

Two layers matter. The compile pipeline runs once when the model is built: PyTorch checkpoint → MLC’s TVM compiler → WGSL kernels + a metadata file. The runtime runs on each tab open: download the kernels + weights (or hit the IndexedDB cache), JIT-compile the WGSL on the device’s GPU, run inference. The first compile pass is where Chrome freezes for 3–5 s the first time someone opens your app.

What’s actually in @mlc-ai/web-llm

The user surface is TypeScript — that’s what real apps call. The kernels underneath are WGSL.

import { CreateMLCEngine } from "@mlc-ai/web-llm" const engine = await CreateMLCEngine( "Llama-3.2-1B-Instruct-q4f16_1-MLC", { initProgressCallback: (p) => { // p.progress is a float 0..1; p.text is "Fetching weights / Compiling…" console.log(p.text, p.progress) }, } ) const response = await engine.chat.completions.create({ messages: [{ role: "user", content: "Why is the sky blue?" }], stream: true, }) for await (const chunk of response) { const delta = chunk.choices[0].delta.content process.stdout.write(delta || "") }

That’s the whole API. It’s deliberately OpenAI-compatible — chat.completions.create looks identical to openai.chat.completions.create. You can swap an OpenAI call for a local one by changing the import.

A glance at the WGSL kernel underneath

When the engine “compiles shaders” on first run, this is what’s getting compiled — a WGSL compute shader that looks a lot like CUDA, but in W3C-standard syntax:

// A simplified Q4 dequant + matmul tile in WGSL. @group(0) @binding(0) var<storage, read> weights_q4 : array<u32>; // packed 4-bit weights @group(0) @binding(1) var<storage, read> scales : array<f16>; // per-group scale @group(0) @binding(2) var<storage, read> input : array<f16>; // activations @group(0) @binding(3) var<storage, read_write> out : array<f16>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid : vec3<u32>) { let row = gid.x; var acc : f32 = 0.0; for (var k : u32 = 0u; k < D; k = k + 1u) { // Unpack two int4 weights from one u32 byte. let packed = weights_q4[(row * D + k) >> 3]; let w_q = (packed >> ((k & 7u) * 4u)) & 0xFu; let w = (f32(w_q) - 8.0) * f32(scales[(row * D + k) / 32u]); acc = acc + w * f32(input[k]); } out[row] = f16(acc); }

That kernel is the same shape you’d write in CUDA C++ — workgroups, shared bindings, an inner-product loop — except the device that runs it is whatever GPU the user happens to have.

What CreateMLCEngine actually does on first run

  1. Fetches mlc-chat-config.json (model metadata: vocab size, context length, quantization).
  2. Fetches *.bin shards (the quantized weights — ~750 MB for 1B Q4F16, in 50 MB shards). Streams to IndexedDB as they download.
  3. Fetches *.wasm (the model’s WGSL kernels packaged as a WebAssembly module).
  4. Initializes a WebGPU device: navigator.gpu.requestAdapter()adapter.requestDevice().
  5. The freeze: compiles every WGSL shader. ~50–200 shaders per model. Browser shows the tab as unresponsive for 3–5 s on first load.
  6. Loads weights from IndexedDB into GPU buffers.
  7. Returns the engine.

On subsequent loads (same model_id + same browser), steps 1–3 hit IndexedDB instead of network. Total cold-load: 30–60 s on broadband. Warm-load: 2–5 s.

Quantization in the browser

The MLC compiler ships q4f16_1 as the default: 4-bit weights + FP16 activations. This is roughly equivalent to — group-wise asymmetric quantization with FP16 zero-points.

The size math:

ParamBytes/paramLlama-3.2-1B sizeLlama-3.2-7B size
FP16 (full)22.4 GB~14 GB
Q4F16 (MLC default)~0.5750 MB~4 GB
Q4F16 + KV at FP16~0.5~1.0 GB working~5.5 GB working

7B at 4 bits just fits on a phone with 6 GB usable memory; 1B is comfortable on anything.

WebGPU’s actual performance

The honest numbers, from MLC-LLM’s own benchmarks on Llama-3.2-1B Q4F16:

DeviceBrowserTok/s
M2 MacBook AirChrome 13070–80
M2 MacBook AirSafari 1850–60
iPhone 15 ProSafari 1818–22
Pixel 8Chrome 13014–18
$300 Chromebook (Intel UHD)Chrome 1304–6

For 7B Q4F16: roughly half the tok/s, but you also need the device to fit it in memory.

The throttle is GPU memory bandwidth, not compute. WebGPU adds ~20% overhead vs. native MLC (which uses Metal/Vulkan directly), mostly from API cost and the WGSL shader-compile constraint that some optimizations native compilers do (e.g. CUDA’s tiled-MM intrinsics) aren’t expressible in WGSL.

WebNN — the next-generation alternative

WebGPU asks you to write shaders. WebNN asks you to describe the model as ops:

const builder = new MLGraphBuilder(context) const a = builder.input('a', { dataType: 'float32', shape: [1, 768] }) const b = builder.constant({ dataType: 'float32', shape: [768, 768] }, weights) const c = builder.matmul(a, b) const graph = await builder.build({ c })

The browser then routes the graph to the platform’s accelerator: ANE on macOS, DirectML→NPU on Windows, eventually NNAPI on Android. WebNN runs 2–4× faster than WebGPU on devices with a real NPU because it bypasses the GPU entirely.

The catch: WebNN is shipping in Chrome 130+ on Windows + macOS as of 2026, but model coverage is still small (no Llama yet — only the basic CV/CNN models). The MLC team has stated they’ll add a WebNN backend once the op coverage stabilizes. Watch this space.

Choosing between WebLLM, transformers.js, and WebNN

If you want…Use
A chatbot UI with Llama / Phi / Mistral / GemmaWebLLM
Whisper, BERT, CLIP, segmentation, embeddingstransformers.js
The fastest LLM inference on a 2026 device with an NPUWebNN (when the model you want is supported; otherwise WebLLM)
The smallest possible bundle for a one-off tasktransformers.js with the WASM-only backend (no WebGPU)

WebLLM is the right default for new chat-style apps. transformers.js is the right default for everything else.

Run it in your browser

A practical demo: simulate WebLLM’s IndexedDB cache behavior. The browser stores model shards keyed by URL; subsequent loads skip the network entirely.

Python — editableSimulate the cache-hit math: how does cold-load vs. warm-load look as you bump the model size and the network speed?
Ctrl+Enter to run

The headline insight from the math: cold-load on a phone over LTE is unusable for 7B (~10 minutes). 1B over broadband is fine (~2 minutes). For most apps, target 1B–3B and accept the cold-load cost on first visit; warm-load is always fast.

Quick check

Quick check
A user opens your WebLLM-powered site for the second time on the same phone, same browser. They get a 3-second freeze right when they expect tokens to start. What is happening?

Key takeaways

  1. WebGPU is the W3C-standard universal GPU — same WGSL kernel runs on Metal, D3D12, and Vulkan. Stable in every modern browser.
  2. WebLLM (@mlc-ai/web-llm) is the production-ready way to ship Llama / Phi / Mistral / Gemma to a browser tab. OpenAI-compatible API.
  3. transformers.js covers everything else (Whisper, CLIP, BERT, segmentation) but is slower for LLMs.
  4. The cold-load cost is real — 30–60 s for 1B on broadband — but warm-load is 2–5 s and IndexedDB-cached.
  5. WebNN is the future for LLM inference on devices with NPUs (2–4× faster than WebGPU), but op coverage in 2026 is still limited.
  6. A static deploy on GitHub Pages can serve a working LLM to a million users for $0/month.

Go deeper

TL;DR

  • WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device’s GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
  • WebLLM (@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB.
  • transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
  • WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
  • The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.

Why this matters

Every other on-device path requires a native app. The browser is the only target where you can ship to “every device that exists on the internet” with a single static deploy. The 2026 inflection: WebGPU on iOS Safari shipped in 2024, on Android Chrome in 2023, on M-series Macs forever. Suddenly every shipping device can run a small LLM in a tab.

This unlocks:

  • Privacy-by-default LLM apps — a code-explainer extension, a journaling app, a kid’s tutor — where prompts literally never leave the device.
  • Offline-capable PWAs — the model lives in IndexedDB; the app keeps working in airplane mode.
  • Zero-cost demos — a $0/month static site with a working LLM that scales to a million users (the compute is on their devices).
  • Embed-anywhere widgets — a doc page on Stripe.com can run an LLM-powered help widget without round-tripping to OpenAI.

The trade-off: the browser is a sandbox. You get ~50% of native performance and you pay a 5–60 second cold-load penalty. Both shrink every quarter.

Mental model

Two layers matter. The compile pipeline runs once when the model is built: PyTorch checkpoint → MLC’s TVM compiler → WGSL kernels + a metadata file. The runtime runs on each tab open: download the kernels + weights (or hit the IndexedDB cache), JIT-compile the WGSL on the device’s GPU, run inference. The first compile pass is where Chrome freezes for 3–5 s the first time someone opens your app.

Concrete walkthrough

What’s actually in @mlc-ai/web-llm

import { CreateMLCEngine } from "@mlc-ai/web-llm" const engine = await CreateMLCEngine( "Llama-3.2-1B-Instruct-q4f16_1-MLC", { initProgressCallback: (p) => { // p.progress is a float 0..1; p.text is "Fetching weights / Compiling…" console.log(p.text, p.progress) }, } ) const response = await engine.chat.completions.create({ messages: [{ role: "user", content: "Why is the sky blue?" }], stream: true, }) for await (const chunk of response) { const delta = chunk.choices[0].delta.content process.stdout.write(delta || "") }

That’s the whole API. It’s deliberately OpenAI-compatible — chat.completions.create looks identical to openai.chat.completions.create. You can swap an OpenAI call for a local one by changing the import.

What CreateMLCEngine actually does on first run

  1. Fetches mlc-chat-config.json (model metadata: vocab size, context length, quantization).
  2. Fetches *.bin shards (the quantized weights — ~750 MB for 1B Q4F16, in 50 MB shards). Streams to IndexedDB as they download.
  3. Fetches *.wasm (the model’s WGSL kernels packaged as a WebAssembly module).
  4. Initializes a WebGPU device: navigator.gpu.requestAdapter()adapter.requestDevice().
  5. The freeze: compiles every WGSL shader. ~50–200 shaders per model. Browser shows the tab as unresponsive for 3–5 s on first load.
  6. Loads weights from IndexedDB into GPU buffers.
  7. Returns the engine.

On subsequent loads (same model_id + same browser), steps 1–3 hit IndexedDB instead of network. Total cold-load: 30–60 s on broadband. Warm-load: 2–5 s.

Quantization in the browser

The MLC compiler ships q4f16_1 as the default: 4-bit weights + FP16 activations. This is roughly equivalent to AWQ — group-wise asymmetric quantization with FP16 zero-points.

The size math:

ParamBytes/paramLlama-3.2-1B sizeLlama-3.2-7B size
FP16 (full)22.4 GB~14 GB
Q4F16 (MLC default)~0.5750 MB~4 GB
Q4F16 + KV at FP16~0.5~1.0 GB working~5.5 GB working

7B at 4 bits just fits on a phone with 6 GB usable memory; 1B is comfortable on anything.

A glance at the WGSL kernel underneath

// A simplified Q4 dequant + matmul tile in WGSL. @group(0) @binding(0) var<storage, read> weights_q4 : array<u32>; // packed 4-bit weights @group(0) @binding(1) var<storage, read> scales : array<f16>; // per-group scale @group(0) @binding(2) var<storage, read> input : array<f16>; // activations @group(0) @binding(3) var<storage, read_write> out : array<f16>; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid : vec3<u32>) { let row = gid.x; var acc : f32 = 0.0; for (var k : u32 = 0u; k < D; k = k + 1u) { let packed = weights_q4[(row * D + k) >> 3]; let w_q = (packed >> ((k & 7u) * 4u)) & 0xFu; let w = (f32(w_q) - 8.0) * f32(scales[(row * D + k) / 32u]); acc = acc + w * f32(input[k]); } out[row] = f16(acc); }

That kernel is the same shape you’d write in CUDA C++ — workgroups, shared bindings, an inner-product loop — except the device that runs it is whatever GPU the user happens to have.

WebGPU’s actual performance

The honest numbers, from MLC-LLM’s own benchmarks on Llama-3.2-1B Q4F16:

DeviceBrowserTok/s
M2 MacBook AirChrome 13070–80
M2 MacBook AirSafari 1850–60
iPhone 15 ProSafari 1818–22
Pixel 8Chrome 13014–18
$300 Chromebook (Intel UHD)Chrome 1304–6

For 7B Q4F16: roughly half the tok/s, but you also need the device to fit it in memory.

The throttle is GPU memory bandwidth, not compute. WebGPU adds ~20% overhead vs. native MLC (which uses Metal/Vulkan directly), mostly from API cost and the WGSL shader-compile constraint that some optimizations native compilers do (e.g. CUDA’s tiled-MM intrinsics) aren’t expressible in WGSL.

WebNN — the next-generation alternative

WebGPU asks you to write shaders. WebNN asks you to describe the model as ops:

const builder = new MLGraphBuilder(context) const a = builder.input('a', { dataType: 'float32', shape: [1, 768] }) const b = builder.constant({ dataType: 'float32', shape: [768, 768] }, weights) const c = builder.matmul(a, b) const graph = await builder.build({ c })

The browser then routes the graph to the platform’s accelerator: ANE on macOS, DirectML→NPU on Windows, eventually NNAPI on Android. WebNN runs 2–4× faster than WebGPU on devices with a real NPU because it bypasses the GPU entirely.

The catch: WebNN is shipping in Chrome 130+ on Windows + macOS as of 2026, but model coverage is still small (no Llama yet — only the basic CV/CNN models). The MLC team has stated they’ll add a WebNN backend once the op coverage stabilizes. Watch this space.

Choosing between WebLLM, transformers.js, and WebNN

If you want…Use
A chatbot UI with Llama / Phi / Mistral / GemmaWebLLM
Whisper, BERT, CLIP, segmentation, embeddingstransformers.js
The fastest LLM inference on a 2026 device with an NPUWebNN (when the model you want is supported; otherwise WebLLM)
The smallest possible bundle for a one-off tasktransformers.js with the WASM-only backend (no WebGPU)

WebLLM is the right default for new chat-style apps. transformers.js is the right default for everything else.

Run it in your browser

A practical demo: simulate WebLLM’s IndexedDB cache behavior. The browser stores model shards keyed by URL; subsequent loads skip the network entirely.

Python — editableSimulate the cache-hit math: how does cold-load vs. warm-load look as you bump the model size and the network speed?
Ctrl+Enter to run

The headline insight from the math: cold-load on a phone over LTE is unusable for 7B (~10 minutes). 1B over broadband is fine (~2 minutes). For most apps, target 1B–3B and accept the cold-load cost on first visit; warm-load is always fast.

Quick check

Quick check
A user opens your WebLLM-powered site for the second time on the same phone, same browser. They get a 3-second freeze right when they expect tokens to start. What is happening?

Key takeaways

  1. WebGPU is the W3C-standard universal GPU — same WGSL kernel runs on Metal, D3D12, and Vulkan. Stable in every modern browser.
  2. WebLLM (@mlc-ai/web-llm) is the production-ready way to ship Llama / Phi / Mistral / Gemma to a browser tab. OpenAI-compatible API.
  3. transformers.js covers everything else (Whisper, CLIP, BERT, segmentation) but is slower for LLMs.
  4. The cold-load cost is real — 30–60 s for 1B on broadband — but warm-load is 2–5 s and IndexedDB-cached.
  5. WebNN is the future for LLM inference on devices with NPUs (2–4× faster than WebGPU), but op coverage in 2026 is still limited.
  6. A static deploy on GitHub Pages can serve a working LLM to a million users for $0/month.

Go deeper