Reading Orders
Mosaic isn’t meant to be read end-to-end (though you can). The lessons are designed to compose. Below are the three orders that form the most coherent arcs through the material — each is a thread you can pull and end up with a real working understanding of one corner of modern AI systems.
There is no required order. Lessons stand alone. Most learners follow a thread for a while, branch into a related one, come back. Use the course map to see the whole graph.
AI Systems
Attention, the KV cache, the inference pipeline, the serving stack — and the contributor-level depth on each that turns “I read about it” into “I’ve shipped a PR on it.”
Architecture
- Multi-Head Attention
- GQA, MQA & MLA
- RoPE + YaRN / LongRoPE
- FlashAttention-3
- FlashAttention-3 Internals
KV Cache 6. KV Cache Basics 7. PagedAttention 8. Prefix & RadixAttention 9. Disaggregated Serving
Inference-time 10. Sampling 11. Structured Output 12. Chunked Prefill 13. Speculative Decoding 14. Speculative Decoding Internals
Serving stack 15. vLLM & SGLang 16. vLLM Internals 17. SGLang Internals 18. Cost & Latency 19. Observability
Contributor track 20. OSS Contribution Playbook
The capstone in Inference Internals — landing a perf-cited PR in vLLM — is the artifact this thread is built around.
ML Compilers
GPUs from the silicon up; the bedrock-meets-tooling layer (roofline, Tensor Core shapes, NCU); the compilers and DSLs that target them.
Hardware mental model
- SM Architecture
- Thread Hierarchy
- Shared Memory
- GEMM (Hopper / Blackwell)
- Strides & Layout
- TMA & cp.async
Roofline & Profiling — the predict-then-verify discipline 7. Roofline as a Predictive Tool 8. Tensor Core SHAPE Constraints 9. Nsight Compute: The Metric Tree
Compiler theory 10. LLVM IR Tour 11. Passes & Pipelines 12. MLIR Overview 13. Dialects & Lowering
Production compilers + kernel DSLs 14. torch.compile 15. Inductor Fusion Heuristics 16. Operator Fusion 17. Triton 18. CuTe & CUTLASS 4 19. ThunderKittens & TileLang 20. JAX & Pallas 21. IREE & ExecuTorch
Distributed kernels + landscape 22. NCCL & AllReduce Internals 23. Hardware Landscape 2026
The capstone — a fused Triton kernel + roofline writeup — is the artifact. The build guide on that page walks you through it step by step (the same artifact Atlas Capstone 1 ships).
Edge AI
Quantization, on-device runtimes, NPUs, the browser, and swarm inference. The part of the field that runs without the cloud — including a LAN of phones running 70B together.
Quantization schemes + calibration
- FP8 Inference
- INT4 / AWQ / GPTQ
- MXFP4 / NVFP4
- Rotation Quantization
- Calibration & KV Cache Quantization
- On-Device Inference
On-Device runtimes (the four production paths)
NPU programming
Edge formats + small-model recipes
Multimodal + distributed at the edge
The capstone arc here is one of the strongest in Mosaic: three runtimes side-by-side on iOS (On-Device Runtimes), a fully-offline voice assistant (Multimodal Edge), and a 4-device LAN running 70B (Distributed Edge) — three end-to-end Edge AI artifacts that compose.
A few notes on how to use this
- Lessons stand alone. If a prerequisite is missing in your head, the lesson links to it at the top. Read the prereq, come back.
- Capstones are optional but worth it. Each module has one. They’re sized for a focused weekend and produce a working artifact (a kernel, a benchmark, an app, a server). The build guide is on the module index page.
- The cheatsheet is the speedrun. Every lesson’s TL;DR aggregates into /cheatsheet — useful for refreshing a concept without re-reading.
- The map is the index. /map shows every lesson, completed and remaining, organized by track.
If a lesson is unclear or wrong, open an issue . Mosaic is built openly; corrections land within days.