WebGPU & WebLLM

In a cloud-serving stack, “ship to every device on the internet” is a punchline, not a goal. You ship to a Linux container in someone’s data center; the user’s device runs a thin client that talks HTTPS. The runtime, the model, the GPU all live behind your API.

The browser is the only target where “ship to every device” is literal. A static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side, in any modern browser, on any device with a GPU — Mac, Windows, Linux, iPhone, Android, Chromebook. No server, no API key, no rate limit, no privacy concerns. The model lives in IndexedDB; the app keeps working in airplane mode.

The trick that makes this work is WebGPU — a W3C standard that exposes the device’s GPU through a common shader language called . WebGPU compiles the same WGSL kernel to Metal on Apple, D3D12 on Windows, Vulkan on Linux/Android. WebLLM (the @mlc-ai/web-llm library) builds on top of that: PyTorch checkpoints get compiled (offline, by the MLC team) into WGSL kernel bundles + quantized weights, and the browser tab JIT-compiles the WGSL on first run. The Python is the SDK that produces the kernel bundle; what runs on the device is TypeScript orchestration calling WGSL on whichever GPU stack the OS provides.

That’s the strange and useful shape of this stack: the production runtime isn’t a binary you ship — it’s a webpage that loads ~750 MB of weights into IndexedDB, compiles a hundred shaders, and starts streaming tokens. The compute happens on the user’s GPU, costs you nothing, and scales linearly with users for free.

TL;DR

WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device’s GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
WebLLM (@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB.
transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.

Why this matters

Every other on-device path requires a native app. The browser is the only target where you can ship to “every device that exists on the internet” with a single static deploy. The 2026 inflection: WebGPU on iOS Safari shipped in 2024, on Android Chrome in 2023, on M-series Macs forever. Suddenly every shipping device can run a small LLM in a tab.

This unlocks:

Privacy-by-default LLM apps — a code-explainer extension, a journaling app, a kid’s tutor — where prompts literally never leave the device.
Offline-capable PWAs — the model lives in IndexedDB; the app keeps working in airplane mode.
Zero-cost demos — a $0/month static site with a working LLM that scales to a million users (the compute is on their devices).
Embed-anywhere widgets — a doc page on Stripe.com can run an LLM-powered help widget without round-tripping to OpenAI.

The trade-off: the browser is a sandbox. You get ~50% of native performance and you pay a 5–60 second cold-load penalty. Both shrink every quarter.

Mental model

Two layers matter. The compile pipeline runs once when the model is built: PyTorch checkpoint → MLC’s TVM compiler → WGSL kernels + a metadata file. The runtime runs on each tab open: download the kernels + weights (or hit the IndexedDB cache), JIT-compile the WGSL on the device’s GPU, run inference. The first compile pass is where Chrome freezes for 3–5 s the first time someone opens your app.

What’s actually in `@mlc-ai/web-llm`

The user surface is TypeScript — that’s what real apps call. The kernels underneath are WGSL.


import { CreateMLCEngine } from "@mlc-ai/web-llm"
 
const engine = await CreateMLCEngine(
  "Llama-3.2-1B-Instruct-q4f16_1-MLC",
  {
    initProgressCallback: (p) => {
      // p.progress is a float 0..1; p.text is "Fetching weights / Compiling…"
      console.log(p.text, p.progress)
    },
  }
)
 
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Why is the sky blue?" }],
  stream: true,
})
 
for await (const chunk of response) {
  const delta = chunk.choices[0].delta.content
  process.stdout.write(delta || "")
}

That’s the whole API. It’s deliberately OpenAI-compatible — chat.completions.create looks identical to openai.chat.completions.create. You can swap an OpenAI call for a local one by changing the import.

A glance at the WGSL kernel underneath

When the engine “compiles shaders” on first run, this is what’s getting compiled — a WGSL compute shader that looks a lot like CUDA, but in W3C-standard syntax:


// A simplified Q4 dequant + matmul tile in WGSL.
@group(0) @binding(0) var<storage, read> weights_q4 : array<u32>;     // packed 4-bit weights
@group(0) @binding(1) var<storage, read> scales    : array<f16>;      // per-group scale
@group(0) @binding(2) var<storage, read> input     : array<f16>;      // activations
@group(0) @binding(3) var<storage, read_write> out : array<f16>;
 
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
  let row = gid.x;
  var acc : f32 = 0.0;
  for (var k : u32 = 0u; k < D; k = k + 1u) {
    // Unpack two int4 weights from one u32 byte.
    let packed = weights_q4[(row * D + k) >> 3];
    let w_q   = (packed >> ((k & 7u) * 4u)) & 0xFu;
    let w     = (f32(w_q) - 8.0) * f32(scales[(row * D + k) / 32u]);
    acc = acc + w * f32(input[k]);
  }
  out[row] = f16(acc);
}

That kernel is the same shape you’d write in CUDA C++ — workgroups, shared bindings, an inner-product loop — except the device that runs it is whatever GPU the user happens to have.

What `CreateMLCEngine` actually does on first run

Fetches mlc-chat-config.json (model metadata: vocab size, context length, quantization).
Fetches *.bin shards (the quantized weights — ~750 MB for 1B Q4F16, in 50 MB shards). Streams to IndexedDB as they download.
Fetches *.wasm (the model’s WGSL kernels packaged as a WebAssembly module).
Initializes a WebGPU device: navigator.gpu.requestAdapter() → adapter.requestDevice().
The freeze: compiles every WGSL shader. ~50–200 shaders per model. Browser shows the tab as unresponsive for 3–5 s on first load.
Loads weights from IndexedDB into GPU buffers.
Returns the engine.

On subsequent loads (same model_id + same browser), steps 1–3 hit IndexedDB instead of network. Total cold-load: 30–60 s on broadband. Warm-load: 2–5 s.

Quantization in the browser

The MLC compiler ships q4f16_1 as the default: 4-bit weights + FP16 activations. This is roughly equivalent to — group-wise asymmetric quantization with FP16 zero-points.

The size math:

Param	Bytes/param	Llama-3.2-1B size	Llama-3.2-7B size
FP16 (full)	2	2.4 GB	~14 GB
Q4F16 (MLC default)	~0.5	750 MB	~4 GB
Q4F16 + KV at FP16	~0.5	~1.0 GB working	~5.5 GB working

7B at 4 bits just fits on a phone with 6 GB usable memory; 1B is comfortable on anything.

WebGPU’s actual performance

The honest numbers, from MLC-LLM’s own benchmarks on Llama-3.2-1B Q4F16:

Device	Browser	Tok/s
M2 MacBook Air	Chrome 130	70–80
M2 MacBook Air	Safari 18	50–60
iPhone 15 Pro	Safari 18	18–22
Pixel 8	Chrome 130	14–18
$300 Chromebook (Intel UHD)	Chrome 130	4–6

For 7B Q4F16: roughly half the tok/s, but you also need the device to fit it in memory.

The throttle is GPU memory bandwidth, not compute. WebGPU adds ~20% overhead vs. native MLC (which uses Metal/Vulkan directly), mostly from API cost and the WGSL shader-compile constraint that some optimizations native compilers do (e.g. CUDA’s tiled-MM intrinsics) aren’t expressible in WGSL.

WebNN — the next-generation alternative

WebGPU asks you to write shaders. WebNN asks you to describe the model as ops:


const builder = new MLGraphBuilder(context)
const a = builder.input('a', { dataType: 'float32', shape: [1, 768] })
const b = builder.constant({ dataType: 'float32', shape: [768, 768] }, weights)
const c = builder.matmul(a, b)
const graph = await builder.build({ c })

The browser then routes the graph to the platform’s accelerator: ANE on macOS, DirectML→NPU on Windows, eventually NNAPI on Android. WebNN runs 2–4× faster than WebGPU on devices with a real NPU because it bypasses the GPU entirely.

The catch: WebNN is shipping in Chrome 130+ on Windows + macOS as of 2026, but model coverage is still small (no Llama yet — only the basic CV/CNN models). The MLC team has stated they’ll add a WebNN backend once the op coverage stabilizes. Watch this space.

Choosing between WebLLM, transformers.js, and WebNN

If you want…	Use
A chatbot UI with Llama / Phi / Mistral / Gemma	WebLLM
Whisper, BERT, CLIP, segmentation, embeddings	transformers.js
The fastest LLM inference on a 2026 device with an NPU	WebNN (when the model you want is supported; otherwise WebLLM)
The smallest possible bundle for a one-off task	transformers.js with the WASM-only backend (no WebGPU)

WebLLM is the right default for new chat-style apps. transformers.js is the right default for everything else.

Run it in your browser

A practical demo: simulate WebLLM’s IndexedDB cache behavior. The browser stores model shards keyed by URL; subsequent loads skip the network entirely.

Python — editableSimulate the cache-hit math: how does cold-load vs. warm-load look as you bump the model size and the network speed?

# Simulate WebLLM's IndexedDB cache behavior.
def load_time(weights_mb, network_mbps, cached, shader_compile_s, gpu_load_s):
  if cached:
      # Warm path: read from IndexedDB (~500 MB/s on M-series, ~150 MB/s on phones)
      # Skip shader compile (cached); skip network.
      idb_speed_mbps = 250  # average across devices
      return weights_mb / idb_speed_mbps + gpu_load_s
  else:
      # Cold path: network + first-shader-compile + GPU load
      network_s = (weights_mb * 8) / network_mbps  # MB → Mb → s
      return network_s + shader_compile_s + gpu_load_s

# Llama-3.2-1B Q4F16 = 750 MB
# Llama-3.2-7B Q4F16 = 4000 MB
for name, mb in [("1B", 750), ("3B", 1900), ("7B", 4000)]:
  cold_50 = load_time(mb, 50, False, 4.0, 1.5)
  cold_100 = load_time(mb, 100, False, 4.0, 1.5)
  warm = load_time(mb, 0, True, 0.0, 1.5)
  print(f"{name:>3}  cold@50Mbps={cold_50:5.1f}s  cold@100Mbps={cold_100:5.1f}s  warm={warm:4.1f}s")

# Simulate WebLLM's IndexedDB cache behavior.
def load_time(weights_mb, network_mbps, cached, shader_compile_s, gpu_load_s):
  if cached:
      # Warm path: read from IndexedDB (~500 MB/s on M-series, ~150 MB/s on phones)
      # Skip shader compile (cached); skip network.
      idb_speed_mbps = 250  # average across devices
      return weights_mb / idb_speed_mbps + gpu_load_s
  else:
      # Cold path: network + first-shader-compile + GPU load
      network_s = (weights_mb * 8) / network_mbps  # MB → Mb → s
      return network_s + shader_compile_s + gpu_load_s

# Llama-3.2-1B Q4F16 = 750 MB
# Llama-3.2-7B Q4F16 = 4000 MB
for name, mb in [("1B", 750), ("3B", 1900), ("7B", 4000)]:
  cold_50 = load_time(mb, 50, False, 4.0, 1.5)
  cold_100 = load_time(mb, 100, False, 4.0, 1.5)
  warm = load_time(mb, 0, True, 0.0, 1.5)
  print(f"{name:>3}  cold@50Mbps={cold_50:5.1f}s  cold@100Mbps={cold_100:5.1f}s  warm={warm:4.1f}s")

# Simulate WebLLM's IndexedDB cache behavior.
def load_time(weights_mb, network_mbps, cached, shader_compile_s, gpu_load_s):
  if cached:
      # Warm path: read from IndexedDB (~500 MB/s on M-series, ~150 MB/s on phones)
      # Skip shader compile (cached); skip network.
      idb_speed_mbps = 250  # average across devices
      return weights_mb / idb_speed_mbps + gpu_load_s
  else:
      # Cold path: network + first-shader-compile + GPU load
      network_s = (weights_mb * 8) / network_mbps  # MB → Mb → s
      return network_s + shader_compile_s + gpu_load_s

# Llama-3.2-1B Q4F16 = 750 MB
# Llama-3.2-7B Q4F16 = 4000 MB
for name, mb in [("1B", 750), ("3B", 1900), ("7B", 4000)]:
  cold_50 = load_time(mb, 50, False, 4.0, 1.5)
  cold_100 = load_time(mb, 100, False, 4.0, 1.5)
  warm = load_time(mb, 0, True, 0.0, 1.5)
  print(f"{name:>3}  cold@50Mbps={cold_50:5.1f}s  cold@100Mbps={cold_100:5.1f}s  warm={warm:4.1f}s")

Ctrl+Enter to run

The headline insight from the math: cold-load on a phone over LTE is unusable for 7B (~10 minutes). 1B over broadband is fine (~2 minutes). For most apps, target 1B–3B and accept the cold-load cost on first visit; warm-load is always fast.

Quick check

A user opens your WebLLM-powered site for the second time on the same phone, same browser. They get a 3-second freeze right when they expect tokens to start. What is happening?

Key takeaways

WebGPU is the W3C-standard universal GPU — same WGSL kernel runs on Metal, D3D12, and Vulkan. Stable in every modern browser.
WebLLM (@mlc-ai/web-llm) is the production-ready way to ship Llama / Phi / Mistral / Gemma to a browser tab. OpenAI-compatible API.
transformers.js covers everything else (Whisper, CLIP, BERT, segmentation) but is slower for LLMs.
The cold-load cost is real — 30–60 s for 1B on broadband — but warm-load is 2–5 s and IndexedDB-cached.
WebNN is the future for LLM inference on devices with NPUs (2–4× faster than WebGPU), but op coverage in 2026 is still limited.
A static deploy on GitHub Pages can serve a working LLM to a million users for $0/month.

Go deeper

DocsWebLLM Docs · MLC TeamAPI reference, prebuilt model list, browser compat matrix.
DocsWebGPU Specification · W3CThe W3C standard. Skim §2 (concepts) and §13 (WGSL).
Docstransformers.js Docs · Hugging FaceThe model zoo + pipelines; the WASM/WebGPU runtime story.
PaperMLC-LLM: Universal LLM Deployment Engine With ML Compilation · Lai et al., 2024The compiler stack behind WebLLM — TVM Unity, kernel generation, the WebGPU backend.
DocsWebNN Specification · W3CThe next-generation NN API. Skim the op list to see what models are reachable.
BlogGPU-Accelerated LLM on a $50 Orange Pi via WebGPU · MLC TeamConcrete numbers from the smallest device class. Hits the universality theme directly.
Repomlc-ai/web-llm · MLC TeamThe reference implementation; read `src/engine.ts` for the full lifecycle.
VideoWebGPU Inference: From Shader to Token · Tianqi ChenThe technical talk for the algorithmic-deep-dive crowd.

TL;DR

WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device’s GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
WebLLM (@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB.
transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.

Why this matters

This unlocks:

Privacy-by-default LLM apps — a code-explainer extension, a journaling app, a kid’s tutor — where prompts literally never leave the device.
Offline-capable PWAs — the model lives in IndexedDB; the app keeps working in airplane mode.
Zero-cost demos — a $0/month static site with a working LLM that scales to a million users (the compute is on their devices).
Embed-anywhere widgets — a doc page on Stripe.com can run an LLM-powered help widget without round-tripping to OpenAI.

The trade-off: the browser is a sandbox. You get ~50% of native performance and you pay a 5–60 second cold-load penalty. Both shrink every quarter.

Mental model

Concrete walkthrough

What’s actually in `@mlc-ai/web-llm`


import { CreateMLCEngine } from "@mlc-ai/web-llm"
 
const engine = await CreateMLCEngine(
  "Llama-3.2-1B-Instruct-q4f16_1-MLC",
  {
    initProgressCallback: (p) => {
      // p.progress is a float 0..1; p.text is "Fetching weights / Compiling…"
      console.log(p.text, p.progress)
    },
  }
)
 
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Why is the sky blue?" }],
  stream: true,
})
 
for await (const chunk of response) {
  const delta = chunk.choices[0].delta.content
  process.stdout.write(delta || "")
}

What `CreateMLCEngine` actually does on first run

Fetches mlc-chat-config.json (model metadata: vocab size, context length, quantization).
Fetches *.bin shards (the quantized weights — ~750 MB for 1B Q4F16, in 50 MB shards). Streams to IndexedDB as they download.
Fetches *.wasm (the model’s WGSL kernels packaged as a WebAssembly module).
Initializes a WebGPU device: navigator.gpu.requestAdapter() → adapter.requestDevice().
The freeze: compiles every WGSL shader. ~50–200 shaders per model. Browser shows the tab as unresponsive for 3–5 s on first load.
Loads weights from IndexedDB into GPU buffers.
Returns the engine.

On subsequent loads (same model_id + same browser), steps 1–3 hit IndexedDB instead of network. Total cold-load: 30–60 s on broadband. Warm-load: 2–5 s.

Quantization in the browser

The MLC compiler ships q4f16_1 as the default: 4-bit weights + FP16 activations. This is roughly equivalent to AWQ — group-wise asymmetric quantization with FP16 zero-points.

The size math:

Param	Bytes/param	Llama-3.2-1B size	Llama-3.2-7B size
FP16 (full)	2	2.4 GB	~14 GB
Q4F16 (MLC default)	~0.5	750 MB	~4 GB
Q4F16 + KV at FP16	~0.5	~1.0 GB working	~5.5 GB working

7B at 4 bits just fits on a phone with 6 GB usable memory; 1B is comfortable on anything.

A glance at the WGSL kernel underneath


// A simplified Q4 dequant + matmul tile in WGSL.
@group(0) @binding(0) var<storage, read> weights_q4 : array<u32>;     // packed 4-bit weights
@group(0) @binding(1) var<storage, read> scales    : array<f16>;      // per-group scale
@group(0) @binding(2) var<storage, read> input     : array<f16>;      // activations
@group(0) @binding(3) var<storage, read_write> out : array<f16>;
 
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3<u32>) {
  let row = gid.x;
  var acc : f32 = 0.0;
  for (var k : u32 = 0u; k < D; k = k + 1u) {
    let packed = weights_q4[(row * D + k) >> 3];
    let w_q   = (packed >> ((k & 7u) * 4u)) & 0xFu;
    let w     = (f32(w_q) - 8.0) * f32(scales[(row * D + k) / 32u]);
    acc = acc + w * f32(input[k]);
  }
  out[row] = f16(acc);
}

That kernel is the same shape you’d write in CUDA C++ — workgroups, shared bindings, an inner-product loop — except the device that runs it is whatever GPU the user happens to have.

WebGPU’s actual performance

The honest numbers, from MLC-LLM’s own benchmarks on Llama-3.2-1B Q4F16:

Device	Browser	Tok/s
M2 MacBook Air	Chrome 130	70–80
M2 MacBook Air	Safari 18	50–60
iPhone 15 Pro	Safari 18	18–22
Pixel 8	Chrome 130	14–18
$300 Chromebook (Intel UHD)	Chrome 130	4–6

For 7B Q4F16: roughly half the tok/s, but you also need the device to fit it in memory.

WebNN — the next-generation alternative

WebGPU asks you to write shaders. WebNN asks you to describe the model as ops:


const builder = new MLGraphBuilder(context)
const a = builder.input('a', { dataType: 'float32', shape: [1, 768] })
const b = builder.constant({ dataType: 'float32', shape: [768, 768] }, weights)
const c = builder.matmul(a, b)
const graph = await builder.build({ c })

Choosing between WebLLM, transformers.js, and WebNN

If you want…	Use
A chatbot UI with Llama / Phi / Mistral / Gemma	WebLLM
Whisper, BERT, CLIP, segmentation, embeddings	transformers.js
The fastest LLM inference on a 2026 device with an NPU	WebNN (when the model you want is supported; otherwise WebLLM)
The smallest possible bundle for a one-off task	transformers.js with the WASM-only backend (no WebGPU)

WebLLM is the right default for new chat-style apps. transformers.js is the right default for everything else.

Run it in your browser

A practical demo: simulate WebLLM’s IndexedDB cache behavior. The browser stores model shards keyed by URL; subsequent loads skip the network entirely.

Python — editableSimulate the cache-hit math: how does cold-load vs. warm-load look as you bump the model size and the network speed?

# Simulate WebLLM's IndexedDB cache behavior.
def load_time(weights_mb, network_mbps, cached, shader_compile_s, gpu_load_s):
  if cached:
      # Warm path: read from IndexedDB (~500 MB/s on M-series, ~150 MB/s on phones)
      # Skip shader compile (cached); skip network.
      idb_speed_mbps = 250  # average across devices
      return weights_mb / idb_speed_mbps + gpu_load_s
  else:
      # Cold path: network + first-shader-compile + GPU load
      network_s = (weights_mb * 8) / network_mbps  # MB → Mb → s
      return network_s + shader_compile_s + gpu_load_s

# Llama-3.2-1B Q4F16 = 750 MB
# Llama-3.2-7B Q4F16 = 4000 MB
for name, mb in [("1B", 750), ("3B", 1900), ("7B", 4000)]:
  cold_50 = load_time(mb, 50, False, 4.0, 1.5)
  cold_100 = load_time(mb, 100, False, 4.0, 1.5)
  warm = load_time(mb, 0, True, 0.0, 1.5)
  print(f"{name:>3}  cold@50Mbps={cold_50:5.1f}s  cold@100Mbps={cold_100:5.1f}s  warm={warm:4.1f}s")

# Simulate WebLLM's IndexedDB cache behavior.
def load_time(weights_mb, network_mbps, cached, shader_compile_s, gpu_load_s):
  if cached:
      # Warm path: read from IndexedDB (~500 MB/s on M-series, ~150 MB/s on phones)
      # Skip shader compile (cached); skip network.
      idb_speed_mbps = 250  # average across devices
      return weights_mb / idb_speed_mbps + gpu_load_s
  else:
      # Cold path: network + first-shader-compile + GPU load
      network_s = (weights_mb * 8) / network_mbps  # MB → Mb → s
      return network_s + shader_compile_s + gpu_load_s

# Llama-3.2-1B Q4F16 = 750 MB
# Llama-3.2-7B Q4F16 = 4000 MB
for name, mb in [("1B", 750), ("3B", 1900), ("7B", 4000)]:
  cold_50 = load_time(mb, 50, False, 4.0, 1.5)
  cold_100 = load_time(mb, 100, False, 4.0, 1.5)
  warm = load_time(mb, 0, True, 0.0, 1.5)
  print(f"{name:>3}  cold@50Mbps={cold_50:5.1f}s  cold@100Mbps={cold_100:5.1f}s  warm={warm:4.1f}s")

# Simulate WebLLM's IndexedDB cache behavior.
def load_time(weights_mb, network_mbps, cached, shader_compile_s, gpu_load_s):
  if cached:
      # Warm path: read from IndexedDB (~500 MB/s on M-series, ~150 MB/s on phones)
      # Skip shader compile (cached); skip network.
      idb_speed_mbps = 250  # average across devices
      return weights_mb / idb_speed_mbps + gpu_load_s
  else:
      # Cold path: network + first-shader-compile + GPU load
      network_s = (weights_mb * 8) / network_mbps  # MB → Mb → s
      return network_s + shader_compile_s + gpu_load_s

# Llama-3.2-1B Q4F16 = 750 MB
# Llama-3.2-7B Q4F16 = 4000 MB
for name, mb in [("1B", 750), ("3B", 1900), ("7B", 4000)]:
  cold_50 = load_time(mb, 50, False, 4.0, 1.5)
  cold_100 = load_time(mb, 100, False, 4.0, 1.5)
  warm = load_time(mb, 0, True, 0.0, 1.5)
  print(f"{name:>3}  cold@50Mbps={cold_50:5.1f}s  cold@100Mbps={cold_100:5.1f}s  warm={warm:4.1f}s")

Ctrl+Enter to run

Quick check

A user opens your WebLLM-powered site for the second time on the same phone, same browser. They get a 3-second freeze right when they expect tokens to start. What is happening?

Key takeaways

WebGPU is the W3C-standard universal GPU — same WGSL kernel runs on Metal, D3D12, and Vulkan. Stable in every modern browser.
WebLLM (@mlc-ai/web-llm) is the production-ready way to ship Llama / Phi / Mistral / Gemma to a browser tab. OpenAI-compatible API.
transformers.js covers everything else (Whisper, CLIP, BERT, segmentation) but is slower for LLMs.
The cold-load cost is real — 30–60 s for 1B on broadband — but warm-load is 2–5 s and IndexedDB-cached.
WebNN is the future for LLM inference on devices with NPUs (2–4× faster than WebGPU), but op coverage in 2026 is still limited.
A static deploy on GitHub Pages can serve a working LLM to a million users for $0/month.

Go deeper

DocsWebLLM Docs · MLC TeamAPI reference, prebuilt model list, browser compat matrix.
DocsWebGPU Specification · W3CThe W3C standard. Skim §2 (concepts) and §13 (WGSL).
Docstransformers.js Docs · Hugging FaceThe model zoo + pipelines; the WASM/WebGPU runtime story.
PaperMLC-LLM: Universal LLM Deployment Engine With ML Compilation · Lai et al., 2024The compiler stack behind WebLLM — TVM Unity, kernel generation, the WebGPU backend.
DocsWebNN Specification · W3CThe next-generation NN API. Skim the op list to see what models are reachable.
BlogGPU-Accelerated LLM on a $50 Orange Pi via WebGPU · MLC TeamConcrete numbers from the smallest device class. Hits the universality theme directly.
Repomlc-ai/web-llm · MLC TeamThe reference implementation; read `src/engine.ts` for the full lifecycle.
VideoWebGPU Inference: From Shader to Token · Tianqi ChenThe technical talk for the algorithmic-deep-dive crowd.

WebGPU & WebLLM

TL;DR

Why this matters

Mental model

What’s actually in @mlc-ai/web-llm

A glance at the WGSL kernel underneath

What CreateMLCEngine actually does on first run

Quantization in the browser

WebGPU’s actual performance

WebNN — the next-generation alternative

Choosing between WebLLM, transformers.js, and WebNN

Run it in your browser

Quick check

Key takeaways

Go deeper

TL;DR

Why this matters

Mental model

Concrete walkthrough

What’s actually in @mlc-ai/web-llm

What CreateMLCEngine actually does on first run

Quantization in the browser

A glance at the WGSL kernel underneath

WebGPU’s actual performance

WebNN — the next-generation alternative

Choosing between WebLLM, transformers.js, and WebNN

Run it in your browser

Quick check

Key takeaways

Go deeper

What’s actually in `@mlc-ai/web-llm`

What `CreateMLCEngine` actually does on first run

What’s actually in `@mlc-ai/web-llm`

What `CreateMLCEngine` actually does on first run