Skip to content

Reading Orders

Mosaic isn’t meant to be read end-to-end (though you can). The lessons are designed to compose. Below are the three orders that form the most coherent arcs through the material — each is a thread you can pull and end up with a real working understanding of one corner of modern AI systems.

There is no required order. Lessons stand alone. Most learners follow a thread for a while, branch into a related one, come back. Use the course map to see the whole graph.


I. AI Systems

Attention, the KV cache, the inference pipeline, the serving stack.

  1. Multi-Head Attention
  2. GQA, MQA & MLA
  3. RoPE + YaRN / LongRoPE
  4. FlashAttention-3
  5. KV Cache Basics
  6. PagedAttention
  7. Prefix & RadixAttention
  8. Disaggregated Serving
  9. Sampling
  10. Structured Output
  11. Chunked Prefill
  12. Speculative Decoding
  13. vLLM & SGLang
  14. Cost & Latency
  15. Observability

The capstone in Inference-Time Architecture — a 200-line continuous-batching server — is the artifact this thread is built around.


II. ML Compilers & Kernels

GPUs from the silicon up; the compilers and DSLs that target them.

  1. SM Architecture
  2. Thread Hierarchy
  3. Shared Memory
  4. GEMM (Hopper / Blackwell)
  5. Strides & Layout
  6. TMA & cp.async
  7. LLVM IR Tour
  8. Passes & Pipelines
  9. MLIR Overview
  10. Dialects & Lowering
  11. torch.compile
  12. Triton
  13. CuTe & CUTLASS 4
  14. ThunderKittens & TileLang
  15. Operator Fusion
  16. JAX & Pallas
  17. IREE & ExecuTorch
  18. Hardware Landscape 2026

The capstone — a Triton kernel that beats cuBLAS at small-N GEMM — is the artifact. The build guide on that page walks you through it step by step.


III. Edge AI

Quantization, on-device runtimes, NPUs, the browser, and swarm inference. The part of the field that runs without the cloud — including a LAN of phones running 70B together.

Foundations

  1. FP8 Inference
  2. INT4 / AWQ / GPTQ
  3. MXFP4 / NVFP4
  4. Rotation Quantization
  5. On-Device Inference

On-Device runtimes (the four production paths)

  1. llama.cpp Internals
  2. ExecuTorch
  3. Core ML & ANE (intro)
  4. TFLite & LiteRT
  5. WebGPU & WebLLM

NPU programming

  1. Qualcomm Hexagon
  2. Apple Neural Engine — deep dive

Edge formats + small-model recipes

  1. GGUF & i-matrix
  2. Distillation for Edge
  3. Speculative Decoding

Multimodal + distributed at the edge

  1. Whisper.cpp & on-device speech
  2. Mobile VLMs
  3. EXO & Swarm Inference

The capstone arc here is one of the strongest in Mosaic: three runtimes side-by-side on iOS (On-Device Runtimes), a fully-offline voice assistant (Multimodal Edge), and a 4-device LAN running 70B (Distributed Edge) — three end-to-end Edge AI artifacts that compose.


A few notes on how to use this

  • Lessons stand alone. If a prerequisite is missing in your head, the lesson links to it at the top. Read the prereq, come back.
  • Capstones are optional but worth it. Each module has one. They’re sized for a focused weekend and produce a working artifact (a kernel, a benchmark, an app, a server). The build guide is on the module index page.
  • The cheatsheet is the speedrun. Every lesson’s TL;DR aggregates into /cheatsheet — useful for refreshing a concept without re-reading.
  • The map is the index. /map shows every lesson, completed and remaining, organized by track.

If a lesson is unclear or wrong, open an issue . Mosaic is built openly; corrections land within days.