01Verify WebGPU works on your devices20 min
Open chrome://gpu in Chrome (or Safari Develop → Experimental Features → WebGPU). Confirm "WebGPU: Hardware accelerated" on desktop + every phone you plan to test.
checkpoint WebGPU enabled on at least: a desktop browser, an iPhone, an Android phone.
watch out Safari < 17.4 does not have WebGPU; Firefox stable does not yet either. Document the supported-browser matrix in your README so users don't hit a blank page.
02Quantize Llama-3.2-1B to MLC q4f1660 min
Run mlc_llm convert_weight + mlc_llm gen_config on Llama-3.2-1B-Instruct. Output: a folder with `params_shard_*.bin` + `mlc-chat-config.json`. Total weight: ~750 MB.
checkpoint You can `ls` the converted folder and see N shards adding to ~750 MB.
watch out MLC quantization configs are model-family specific — copy one from MLC's prebuilt configs first, then tweak. Custom quant configs without family-specific calibration drop quality measurably.
03Host the weights on Cloudflare R240 min
Create an R2 bucket, upload the converted folder, enable public access, point a custom CNAME at it. Verify a `curl` from a different network can pull a shard.
checkpoint `curl https://your-bucket.r2.dev/Llama-3.2-1B-q4f16/params_shard_0.bin -I` returns 200.
watch out CORS is the silent killer. Set `Access-Control-Allow-Origin: *` on the bucket; without it the browser fetch fails after the model has already shown a "loading…" UI.
04Wire @mlc-ai/web-llm into a Vite app120 min
`npm i @mlc-ai/web-llm`. Use `CreateMLCEngine(model_id, { initProgressCallback })` with a model_record pointing at your R2-hosted weights. Show a download progress bar (the first load is multi-hundred MB).
checkpoint You can ask the model a question on first page load and watch it stream a coherent reply, after the weights download.
watch out web-llm caches model weights in IndexedDB; if you ship a new quantization but reuse the same `model_id`, users get stale weights. Bump the version suffix in the model_id when the weights change.
05Optimize first-load: progressive UI + cache hint90 min
Render the chat UI immediately; show "Loading model (~750 MB, one-time)…" with a real progress bar. Add `<link rel="modulepreload">` for the WASM. After first load, IndexedDB serves from cache and the model opens in <2 s.
checkpoint Cold load (clear cache): UI visible in <500 ms, model usable in <60 s on broadband. Warm load: model usable in <3 s.
watch out The browser blocks the main thread during WGSL shader compilation on first run; the UI freezes for 3–5 s right at the moment users expect tokens to start. Surface this to users with a "compiling shaders" message rather than silent freeze.
06Benchmark + document on real devices90 min
Measure tok/s steady-state on: iPhone 15 (Safari), Pixel 8 (Chrome), M2 MacBook Air (Chrome), $300 Chromebook (Chrome). Make a table + plot in the README.
checkpoint README shows: device, browser, tok/s, peak RAM, time-to-first-token, weight-load time. Numbers feel realistic (M-series ≫ phones for compute-bound; cold-load is similar everywhere).
watch out Tab throttling: if Safari/Chrome backgrounds the tab during a long benchmark run, tok/s plummets. Run benchmarks in the foreground.
07Deploy to GitHub Pages + write the README60 min
Push the static build (`vite build`) to a `gh-pages` branch. Verify the live URL works on a phone over cellular (model download is ~750 MB — note this in the UI). README explains: how WebGPU + MLC works, the supported browsers, the benchmark numbers.
checkpoint A friend can open the URL on their phone, wait through the load, and chat with a 1B Llama running entirely on their device.