Cheatsheet

Every lesson's TL;DR, grouped by track. Press Cmd/Ctrl+K to search the whole course.

06 · Applied AI · Build & Ship

Audio & Voice →

The 2026 voice-agent stack: STT (Whisper / Distil-Whisper) → LLM (Claude / GPT / open) → TTS (Kokoro / OpenAI / ElevenLabs) → audio out. Latency target: under 1 second total round-trip.
Whisper is OpenAI's open-source STT model — multilingual, extremely robust to noise. Distil-Whisper is a 6× smaller / faster variant with ~95% of Whisper-Large's accuracy.
TTS in 2026: Kokoro (~80M params, near-human quality, runs on a phone) is the open frontier; ElevenLabs / OpenAI Voice / PlayHT lead on hosted quality and voice cloning.
End-to-end voice models (GPT-4o-Audio, Claude voice mode, Moshi) skip the STT/TTS sandwich — audio in, audio out, no text intermediate. Latency drops below 300 ms but architectural maturity is still 2025-fresh.
For most product builders: Whisper + LLM + Kokoro running locally is the cheapest path; hosted APIs are the fastest-to-ship.

Capstone — Ship It →

The Applied AI track culminates here: build, eval, and deploy a working RAG-agent that you can use in real life. The lessons up to this point are the parts; this is the assembly.
Pick a real problem you have. Not a demo. Not a Hello-World. Something you would actually use weekly — your reading inbox, your meeting notes, your coding context, your research notes.
The full stack: ingest pipeline → chunked + indexed (hybrid retrieval) → ReAct agent with MCP tools → structured-output API → minimal UI → deployed somewhere persistent.
Eval is the bar. Without an eval, you have a demo. With an eval, you have a product. Build the eval before you start optimizing.
Most teams fail this not on technical capability but on scoping. Cut features ruthlessly. A working v0 in one weekend beats a half-built v3 in a month.

Multimodal (Vision) →

A vision-language model (VLM) processes images alongside text. Architecture: image encoder (ViT or CNN) → projection layer → transformer decoder. Works exactly like text-only LLMs at the decoder stage.
CLIP (OpenAI, 2021) was the breakthrough — contrastive image-text training producing aligned embeddings. Still the foundation of most vision retrieval; SigLIP (Google, 2023) is the modern improvement.
Frontier VLMs in 2026: Claude Sonnet/Opus 4 with vision, GPT-4o, Gemini 2.x, Llama 3.2-Vision, Qwen2-VL, InternVL2 — all natively multimodal, not text-then-vision adapters.
Document understanding (PDFs, charts, screenshots) is the killer app: extract structured data from unstructured visual content. Used heavily in agents that need to "see" software UIs.
For 2026 production: prefer hosted VLMs for general use; self-host Qwen2-VL or InternVL2 when on-device or compliance matters.

Safety & Injection →

Prompt injection is the canonical LLM security failure: untrusted input (a doc, a webpage, a tool result) contains instructions that hijack the model. Two flavors: direct (user types malicious prompt) and indirect (malicious content arrives via tool / RAG / agent loop).
There is no model-side fix that's bulletproof. Guardrails reduce risk; they don't eliminate it. The 2026 production answer is defense in depth.
Llama Guard / Granite Guardian / WildGuard: small classifier LLMs that screen inputs and outputs. ~95% accuracy on adversarial benchmarks; ship in production at every major lab.
Dual-LLM pattern: a "privileged" LLM that holds secrets / takes actions; a "quarantined" LLM that processes untrusted content. They communicate via structured channels only.
Spotlighting: tag untrusted content (delimiters, encodings) so the model knows it's data, not instructions. Combined with Pydantic structured output, the most practical defense.

Embeddings →

An <Term name="embedding">embedding</Term> is a fixed-size vector that represents text (or images, audio, anything). Similar inputs get similar vectors. Cosine similarity is the standard distance.
Modern dense embeddings: 768–4096 dimensions, contrastively-trained on hard pairs. Top open models (April 2026): Nomic Embed v2, BGE-M3, Stella-1.5B, GTE-Qwen2, all near-frontier OpenAI / Voyage parity.
Matryoshka embeddings can be truncated — a 1024-dim vector still works at 256-dim, just with slightly lower quality. Lets one model serve fast / accurate / storage-cheap variants.
Hybrid retrieval (dense + sparse) beats pure dense or pure BM25 by ~5–10 points NDCG@10 on most retrieval benchmarks. Production pipelines reach for both.
Reranking (a smaller cross-encoder model that scores query/doc pairs) at ~50× the cost per call but ~10× higher precision is the third stage of the 2026 retrieval stack.

Prompt Engineering Foundations →

Zero-shot is the default. Add complexity only when the model fails — every extra technique is more tokens, more cost, more places to fail.
Few-shot locks in format and style. It can also lock in errors if your examples are wrong; a bad few-shot is worse than zero-shot.
Chain-of-thought helps on multi-step problems (math, planning) and hurts on simple classification. Not a free lunch.
Self-consistency = sample N CoT reasonings, take the majority answer. Genuinely better on reasoning tasks; N× cost.
Structured output (JSON mode, Pydantic, Outlines) is how you build real systems. Always prefer schema-validated output over regex.

Structured Output →

Reliable structured output requires three layers: schema definition (Pydantic / TypeBox / Zod) → constrained decoding on the server (XGrammar, Outlines) → validation + retry on the client (Instructor, structured-outputs APIs).
Pydantic is the universal frontend. Define your schema as a Python class; libraries (instructor, openai's response_format, anthropic's tool-use, vllm) consume it directly.
The server doing constrained decoding is what matters most. Without it, you're trusting the model to produce valid JSON ~95% of the time, retrying the rest. With it, output is guaranteed.
For 2026 production: vLLM v1 / SGLang / TensorRT-LLM ship XGrammar; OpenAI / Anthropic APIs enforce schemas natively. Default to `response_format=PydanticClass`; don't roll your own JSON parsing.
Schemas should be simple, flat, well-named. Deeply-nested schemas with many enums confuse small models; LLM-friendly schemas align with how training data was shaped.

Function Calling & Tool Use →

Tool use is structured output where the model picks a tool name and fills typed arguments. The model never executes — your code does, then feeds the result back.
Modern API schemas (Anthropic, OpenAI) are nearly identical: define tools with JSON Schema, the model returns tool_use blocks with name + arguments.
The loop: user message → model returns tool calls → your code runs them → feed tool_result back → model produces the final answer (or another tool call).
Parallel tool calls are now the default: a model can request 3 tools in one turn. Run them in parallel; latency drops by N.
Failure modes: hallucinated tool names (rare on frontier models, common on small ones); coercing string args to numbers (always validate); infinite tool loops (set max_steps).

Advanced RAG — HyDE, GraphRAG, Agentic Retrieval →

Naive RAG fails on multi-hop questions, vague queries, and entity-centric corpora. Symptom: high retrieval recall on facts, low task accuracy.
HyDE generates a hypothetical answer with the LLM, embeds it, then retrieves — bridges the query/document phrasing gap.
Query decomposition breaks one question into 2–4 sub-queries, retrieves for each, then reasons over the union. Best for multi-hop.
GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus, then summarizes communities — beats vector search on "global" questions a chunk can't answer alone.
Agentic retrieval lets the model decide its own searches in a loop. Highest ceiling, highest cost, hardest to debug.

Model Context Protocol →

MCP (Model Context Protocol) is an open spec — released by Anthropic in late 2024, adopted broadly through 2025–2026 — for how an LLM client (Claude Desktop, Cursor, custom agent) discovers and calls tools / reads files / fetches prompts from an external server.
Three primitives: tools (functions the LLM can invoke), resources (files / URLs the LLM can read), prompts (parameterized templates the user can invoke).
One protocol, many transports: stdio (local subprocess), SSE / HTTP (network). Servers can be written in TypeScript, Python, Go, Rust — official SDKs in all of these.
The 2026 ecosystem: hundreds of public MCP servers (databases, APIs, file systems, GitHub, Slack, Linear). Claude Desktop, Cursor, VS Code, Zed, Continue.dev all consume MCP. Building an MCP server is the new "build an integration."
Compared to ad-hoc tool definitions: MCP decouples the tool implementation from the client. Write once, ship everywhere.

RAG Fundamentals →

RAG = retrieval + generation. Pull the most relevant chunks for a question, stuff them in the prompt, let the model answer.
The pipeline: chunk documents → embed → store → retrieve top-k → re-rank → stuff into prompt → generate.
Hybrid retrieval (dense + BM25 fused via RRF) beats either alone. Dense gets meaning; BM25 gets exact keywords.
A cross-encoder re-ranker is the highest-leverage stage. It sees the query and chunks together and re-scores. Adds 100 ms, can lift Recall@5 by 15+ points.
Chunking is underrated. Recursive character splitting at ~400 tokens with 50-token overlap is the sane default. Semantic chunking is sometimes worth it; rarely worth it from day one.

Build a ReAct Agent →

ReAct (Yao et al., 2022) = Reason + Act. The agent alternates between thinking (free-text reasoning), acting (calling a tool), and observing (reading the tool's output). Same pattern, every modern LLM agent.
The architecture is small: an LLM, a list of tools (functions), a loop. A working ReAct agent is ~80 lines of Python, no LangChain required.
Tool-use APIs (OpenAI function-calling, Anthropic tools) handle the structured-output side of "the model picks a tool and arguments." Your code dispatches and returns the observation.
Termination conditions matter: max iterations, success-by-the-model-saying-done, or a result-format match. Without them you loop forever.
LangGraph, OpenAI Assistants API, Anthropic's messages.create with tools — all are productionizations of the same loop. Build it once from scratch and the framework code becomes legible.

Cost & Latency →

Production LLM serving has five economic levers, in roughly increasing implementation cost: prefix caching → continuous batching → model routing → quantization → hardware. Each is multiplicative; doing all five well gets you 10–30× cheaper than naive serving.
Prefix caching is the cheapest win and most teams leave it on the table. Hit-rate-aware prompt design takes a day and pays off forever.
Continuous batching is on by default in vLLM v1, SGLang, TensorRT-LLM. Old TGI / HF transformers servers leave 5× on the floor — switch.
Model routing (small model first, escalate on uncertainty) is the highest-leverage lever for tasks where ~70% of queries are easy. Saves 3–8× on the dollar; no quality loss.
FP8 / INT4 quantization halves to quarters memory and bandwidth costs. Modern open weights take quantization with negligible quality regression.
Hardware is last because it's a capex decision. Use the calculator below to compare B200 vs MI355X vs H100 for your workload.

Observability →

LLM systems need three layers of observation that map cleanly to web-app analogues: traces (every input, every tool call, every token, every output), online evals (a judge that scores live traffic and produces a quality time-series), and regression detection (alerts when a deploy moved a metric).
Langfuse, Phoenix, and LangSmith are the 2026 production reference stacks. Langfuse and Phoenix are OSS; LangSmith is paid but tightest with LangChain. They all model the same OpenTelemetry-style trace tree: trace → span → tool/llm call → tokens.
Online evals are not unit tests. Unit tests run on a fixed dataset; online evals score real user traffic, surface drift, and feed regressions into your CI loop.
Five metrics to track from day one: latency (TTFT + TPOT), cost ($/request), trace volume, fail rate (parse errors, refusals), and a single quality score from your judge model. Everything else is derivative.
Cost-down without observability = quality-down silently. Every lever from the previous lesson is a regression risk; you cannot ship them safely without the harness on.

On-Device Inference →

llama.cpp is the universal runtime: CPU, Apple Metal, CUDA, Vulkan. GGUF format. If you don't know what to run, run llama.cpp.
MLX (Apple) uses unified memory on M-series chips — fastest path on a Mac. PyTorch-shaped Python API.
ExecuTorch is PyTorch's mobile/edge runtime; produces .pte files for Android (NNAPI / Vulkan) and iOS (Core ML / MPS).
A 4-bit Q4_K_M quantized 8B model fits in ~5 GB and runs at 5–15 tokens/sec on a modern phone. Genuinely usable.
The unlock is K-quants (mixed precision per row) and Apple/ARM kernels that fuse dequant + matmul.

vLLM & SGLang →

vLLM = the default inference server. PagedAttention, continuous batching, broad model support, well-documented. Easy to deploy.
SGLang = the throughput-focused alternative. RadixAttention for shared-prefix workloads, structured-output via constrained decoding (XGrammar), often 1.5–3× faster on agent / chat workloads where prompts share long preambles.
TGI (HuggingFace) = the convenient choice if you're already in the HF ecosystem; ships less aggressive perf, more polish.
TensorRT-LLM = peak NVIDIA performance; pay in build complexity and Hopper-only optimizations.
Pick by workload: broad inference → vLLM. Many shared-prefix / agentic requests → SGLang. NVIDIA-native max throughput → TensorRT-LLM. Internal tooling within HF → TGI.

05 · ML Compilers & Hardware

LLVM IR Tour →

LLVM IR is a strongly-typed, SSA, RISC-like virtual instruction set. The "language" between your source compiler and the actual machine. Once you can read it, you can read what every modern compiler is actually doing.
It has three forms: the .ll text format (human-readable), the .bc bitcode binary, and the in-memory C++ Module API. They are all the same IR — round-trip lossless.
Three building blocks: Module → Function → BasicBlock → Instruction. SSA means each value is defined exactly once. PHI nodes pick up control-flow merges.
The type system is small but rigid: i1, i8, i32, i64, float, double, half, bfloat, plus pointers, vectors, structs, arrays. No bare ints — every integer carries its width.
Almost every AI compiler eventually lowers to LLVM IR. MLIR's llvm dialect is literally this IR. Triton emits LLVM IR. JAX/XLA emits LLVM IR through StableHLO. Knowing LLVM is the universal floor.

Dialects & Lowering →

Lowering = a pass that rewrites IR from one dialect to a lower-level dialect. The semantics stay the same; the representation changes.
Most lowerings are written as rewrite patterns: "match this op pattern in dialect A; replace with this set of ops in dialect B." MLIR's RewritePatternSet + dialect-conversion driver compose them.
The full pipeline of an AI compiler is a stack of lowerings: linalg → scf → vector → gpu → nvgpu → llvm. Each step keeps just enough structure for the next pass to optimize against.
Type conversion is half the work: tensor becomes memref becomes llvm.ptr. Every op needs a corresponding rewrite for its operand types.
Bufferization is the most important specific lowering in AI compilers — turning value-typed tensors into memory-typed memrefs, which is when allocation decisions get made.

MLIR Overview →

MLIR = Multi-Level IR. One framework for representing programs at many levels of abstraction simultaneously: linear-algebra ops at the top, GPU ops in the middle, LLVM IR at the bottom. Every level is just a different dialect.
Every modern AI compiler — Triton, IREE, JAX/XLA, ExecuTorch, OpenXLA, ONNX-MLIR, Modular's MAX — is built on MLIR. PyTorch's torch.compile lowers through it (via Inductor → Triton → MLIR → LLVM).
The dialect is the unit of vocabulary. linalg, tensor, arith, scf (structured control flow), memref, gpu, nvgpu, vector, llvm are the canonical built-in dialects. Custom domains (TPU, NPU, your hardware) ship their own.
Lowering is the central operation: rewrite IR from a higher-level dialect to a lower one, repeatedly, until you're at llvm and can hand off to LLVM. Every AI compiler is essentially a stack of lowerings.
MLIR is younger than LLVM (Google, 2019) but moves much faster. The API churns — code from 2022 often doesn't compile against current main. Pin your version.

Passes & Pipelines →

An LLVM pass is a function that reads or rewrites IR. Two flavors: analysis passes (compute information about the IR, like dominator trees or alias info) and transformation passes (rewrite IR, like inlining or dead-code elimination).
Passes compose into a pipeline. -O0 is empty; -O2 is ~150 passes; -O3 adds aggressive vectorization. Each pass assumes the previous ones ran.
LLVM 14+ uses the New Pass Manager (NPM). Old legacy::PassManager is deprecated; new code uses PassBuilder + ModulePassManager. The migration matters because tutorials older than 2022 are usually wrong.
Analysis passes are cached and invalidated: if a transformation changes the IR, dependent analyses get re-run lazily. This is what makes pipelines fast.
The single most useful flag for understanding a pipeline: opt --print-after-all — dumps IR after every pass. Scary the first time; indispensable forever after.

CuTe & CUTLASS 4 →

CUTLASS = NVIDIA's open-source C++ template library for GEMM-class kernels. The reference for how to write a fast tensor-core kernel. Every cuBLAS kernel from Hopper onward shares its design language with CUTLASS.
CuTe = the layout sublanguage inside CUTLASS 3.x+. A small, expressive type system for tile shapes and layouts — the maps between logical tile coordinates and physical memory.
The mental shift: everything is a layout. A SMEM tile, a register fragment, a TMA descriptor, a swizzle pattern — all are CuTe Layout values composed by tile, partition, coalesce, etc.
CUTLASS 4 (2024–2025) is the Hopper/Blackwell native version. Adds first-class TMA, WGMMA, warp specialization, persistent kernels, FP8/FP4. Older CUTLASS 2.x is now legacy.
Hand-written CUTLASS still beats Triton by 5–10% on edge shapes and is the only way to access certain NVIDIA-specific instructions immediately at launch. The cost is real C++ template depth.

Hardware Landscape 2026 →

NVIDIA Blackwell (B200, GB200) is the 2025–2026 flagship: 5th-gen tensor cores, FP4/FP6 native, ~2.5× H100 throughput, 192 GB HBM. The default for new frontier work, despite the price.
AMD MI355X matches Blackwell's FP8 throughput, ships 288 GB HBM (50% more than B200), at lower per-hour pricing. Software is still catching up but ROCm 6.x + Triton-AMD make it viable for serious work.
Google TPU v6 (Trillium) — TPU v5p's successor. Locked to GCP and JAX, but extraordinary perf/$ on TPU-shaped workloads. Where Gemini gets trained.
Cerebras WSE-3 / Groq LPU — non-GPU silicon for inference and specific training niches. Cerebras: enormous on-chip memory, dataflow execution, TF-shaped programming model. Groq: deterministic, ultra-low-latency inference for routing-heavy LLM serving.
AWS Trainium 2 — Amazon's training chip. Software (Neuron SDK) is the friction; pricing is the draw. Used at Amazon and increasingly by frontier labs as a complement.
Apple M4 Max / Qualcomm Snapdragon X / Ampere AmpereOne — the non-data-center tier where edge AI lives. Different rules; covered in the Edge AI thread.

ThunderKittens & TileLang →

ThunderKittens (TK) is a Stanford (Hazy Research) C++ embedded DSL for tile-shaped GPU kernels. The thesis: a 16×16 tile of bf16 is the only primitive your kernel really needs. Build everything from that.
TileLang (Microsoft Research) is the Python-syntax cousin: a tile-first kernel language designed for AI workloads, with first-class TMA / WGMMA / warp specialization.
Both pitch the same idea: Triton is great but its tile algebra is partial — TK and TileLang make tiles explicit primitives with full hardware acceleration baked in. Result: kernels that are shorter than Triton and faster.
ThunderKittens shipped the fastest-known FlashAttention-2 forward (April 2024) in ~80 lines of C++. The maintained version supports H100, B200, and AMD MI300X.
These DSLs are 2024–2026 frontier work. They're production-used in HazyResearch's stack, in some labs' kernel teams, and in OSS work like FlexAttention. Worth knowing because it's where things are going, not because they replaced Triton in 2026.

Triton →

Triton is a Python-syntax kernel DSL that compiles via MLIR (TritonGPU dialect → llvm) to PTX. You write what looks like NumPy on tiles; the compiler emits register-tiled, SMEM-staged, tensor-core-using kernels.
The mental shift: you program the block, not the thread. A Triton "program" is one CTA's worth of work. Inside it you operate on whole tiles (vectors, matrices); the compiler picks the threads-per-warp distribution.
`@triton.autotune` picks the best (BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages) per-shape on first run. This is the feature that makes Triton competitive with hand-tuned CUTLASS without the maintenance.
Triton is the daily-driver kernel language of OpenAI, the Triton-using parts of PyTorch (torch.compile lowers to it), and most performance-critical OSS work in 2025–2026. Hand-written CUTLASS still wins by 5–10% on edge cases.
The 2024 frontier: Triton 3.x adds Hopper TMA and warp specialization. Triton on Blackwell (5th-gen tensor cores, FP4) lands incrementally through 2025–2026.

IREE & ExecuTorch →

IREE (Intermediate Representation Execution Environment) is Google's open-source compiler + runtime. Compiles models (any frontend that emits StableHLO/MLIR) to a portable artifact that runs on CPU, GPU, NPU, edge silicon. Strongest production story for AMD, Apple, embedded.
ExecuTorch is PyTorch's mobile runtime. Takes a torch.export()-d model graph, lowers through an MLIR-based path, produces a .pte (PyTorch ExecuTorch) artifact you ship in an app bundle.
Both compile ahead of time. There's no Python at runtime; the deployable artifact is a tiny binary.
Both are MLIR-native — they compose with the dialects you saw in MLIR Overview, and accept custom dialects from hardware vendors. This is how Apple, Qualcomm, and AMD ship their NPU support.
For 2026 deployment: ExecuTorch is the path for PyTorch → mobile (Meta, OEM partners). IREE is the path for cross-framework, cross-hardware deployment, especially edge.

JAX & Pallas →

JAX uses a different compiler stack than PyTorch. jit-decorated functions trace to StableHLO (a stable subset of HLO, the XLA op set), then XLA compiles to GPU/TPU.
XLA is the long-standing graph compiler (Google, ~2017+). It does fusion, layout assignment, code-gen for GPU/TPU/CPU. It's the production compiler that runs Gemini, every Google production model, and JAX-based research everywhere.
Pallas is JAX's kernel DSL — Triton-like, but inside JAX. You write a JAX function, decorate it with pallas.pallas_call, and it lowers to a hand-written kernel for GPU (via Triton) or TPU (via Mosaic, a TPU-specific kernel emitter).
The single most distinctive thing about Pallas: the same kernel source can target both GPU and TPU. The dtype/layout primitives are the same; the backend chooses how to lower.
For 2026 production: JAX is the dominant choice for TPU work, neck-and-neck with PyTorch on GPU. Pallas is the kernel escape hatch — used heavily inside Google for Gemini training, increasingly used externally for high-perf JAX kernels.

Operator Fusion →

- A pointwise op like add, mul, relu, sigmoid reads its inputs from HBM and writes its outputs to HBM. Doing this for 5 chained ops means 10 HBM trips and ~5× the bandwidth a single-pass implementation would need. - Operator fusion = generate one kernel that does the whole chain, keeping intermediates in registers. 2–10× speedup on bandwidth-bound chains, often more on long ones. - Inductor (PyTorch's torch.compile), XLA (JAX/TF), TVM, IREE all do this automatically. They differ in how aggressive they are and what they fuse. - Three fusion classes: 1. Pointwise + pointwise — trivial; always done. 2. Reduction + pointwise (epilogue fusion) — a softmax fused with the multiply that follows it; the canonical attention optimization. 3. Matmul + pointwise (output-tile fusion) — a GEMM with bias-add or activation in the same kernel; what CUTLASS calls "epilogue." - The fusion boundary is a memory-format change or a non-fusable op (random sample, sort, complex control flow). Compilers cluster fusable ops between these boundaries.

torch.compile (Inductor) →

torch.compile(model) swaps PyTorch's eager execution for graph capture (Dynamo) + autograd compilation (AOTAutograd) + kernel codegen (Inductor).
Dynamo captures Python-level graphs by hooking into bytecode evaluation, falling back gracefully when it can't trace.
AOTAutograd lifts the Python autograd into a static forward+backward joint graph that the compiler can optimize together.
Inductor lowers the graph to Triton kernels for GPU and C++ + OpenMP for CPU. Fuses pointwise ops aggressively.
Real-world wins: 1.3–2× speedup on training, 1.5–4× on inference, mostly from kernel fusion and CUDA Graph capture. The default in PyTorch 2.5+.

07 · Edge AI

WebGPU & WebLLM →

WebGPU is the W3C-standard browser GPU API. Stable in Chrome 113+, Edge 113+, Safari 18+; Firefox is behind a flag. It compiles WGSL (a Rust-flavored shader language) to whatever the device's GPU stack speaks: Metal on Apple, D3D12 on Windows, Vulkan on everything else.
WebLLM (@mlc-ai/web-llm) compiles transformers (Llama, Phi, Mistral, Gemma) to WGSL via the MLC compiler stack. The browser tab becomes the runtime. ~750 MB weights for 1B Q4F16; ~4 GB for 7B Q4F16; cached in IndexedDB.
transformers.js is the alternative — runs ONNX models via ONNX Runtime Web (WebGPU + WASM). Lower performance than WebLLM for LLMs, but a richer model zoo (BERT, CLIP, Whisper, segmentation).
WebNN is the upcoming W3C neural network API (not GPU shaders, NN ops). Currently shipping in Chrome 130+ on Windows + macOS. Targets the platform NPU (ANE on macOS, DirectML on Windows). Faster + more power-efficient than WebGPU for LLMs but the model coverage is still small.
The killer property: a static-hosted website on GitHub Pages can deliver a 1B LLM that runs entirely client-side. No server, no API key, no rate limit, no privacy concerns.

Small LLMs & Distillation →

Small LLMs (≤4B params) are not just "shrunk frontier models" — they're produced via deliberate recipes that maximize signal per parameter. The 2024–2026 standard: train longer, distill from a frontier teacher, post-train carefully.
Distillation = train a small "student" to match a larger "teacher's" outputs (logits, hidden states, or chosen-vs-rejected preferences). Most modern small LLMs are distilled from a same-family large model — DeepSeek-V3-distill, Llama-3.3-distill, etc.
Long pretraining matters more than data quality: MiniCPM-3 (2.4B) trained on 5T tokens beats some 7B models trained on 1.5T. The "scaling laws say small needs less" intuition is wrong for small-but-deep training.
Post-training is the gap closer: SFT + DPO (or GRPO) with carefully-curated chat data closes most of the small-vs-large gap on the workloads people actually care about.
TinyLlama (1.1B), MiniCPM-3 (2.4B), Qwen2.5-1.5B/3B, Phi-3.5-mini, Llama-3.2-1B/3B are the 2026 reference small models. Each represents a slightly different recipe; understanding them is understanding the edge-LLM design space.

Speculative Decoding →

Speculative decoding makes an LLM generate tokens 2–4× faster without changing the output. The trick: a small draft model proposes K tokens; the target model verifies all K in one parallel forward pass; the longest accepted prefix is the new context. Tokens are bit-identical to running the target alone.
The math: instead of one forward per token at the slow model, you do one forward per K candidates — and the target model is bandwidth-bound on a phone, so verifying K tokens costs almost the same as decoding 1.
Three families: classic draft-target (Llama-3.2-1B → Llama-3.2-7B), self-speculative (Medusa, EAGLE — the model has additional heads that draft from itself), and lookahead (no draft model; an n-gram cache predicts).
On a phone the typical speedup is 1.8–2.5× for matched-family pairs (1B drafting 7B), with acceptance rate 60–80% on chat workloads. Higher on code (where output is predictable), lower on creative writing.
llama.cpp ships speculative decoding as a first-class API (llama_speculative_*); it's a few lines of code on top of an existing setup. Outputs are bit-identical to non-speculative.

EXO & Swarm Inference →

Swarm inference is pipeline-parallel inference applied at the device-fleet level: shard a model's layers across N devices on a LAN, send activations between them, run a model none of them could run alone.
EXO (exo-explore/exo) is the open-source reference: Python framework, mDNS discovery, dynamic partitioning across iOS / Android / macOS / Linux. Uses tinygrad and mlx as backends. Released early 2024, ~12K GitHub stars.
Petals (bigscience/petals) is the BitTorrent-of-LLMs cousin — a public swarm where strangers share unused compute. Slower and less private than EXO but useful for genuinely large models (Llama-3.1-405B).
The math: a 4-device LAN running pipeline-parallel inference on a 70B model averages 3–5 tok/s — readable but not snappy. The bottleneck is per-hop network latency, not compute.
Wifi-6 or wired ethernet, ~12–30 ms per hop. Wifi-5 (.11ac) hits 50–100 ms tail latency under contention and tanks tok/s.

GGUF & i-matrix →

GGUF carries quantized weights plus the recipe that quantized them. The K-quant variants (Q4_K_M, Q5_K_M, etc.) are baked into the format spec.
K-quants store per-super-block (256 weights) statistics plus per-sub-block (16/32 weights) refinements. More precision than naive INT4, no calibration data needed for "OK" results.
The i-matrix (importance matrix) is a calibration-driven addition. Compute activation statistics on a calibration dataset; weight the quantization-error metric by importance per channel; feed back into the K-quant chooser. Bumps quality of Q4_K_M / Q3_K from "OK" to "indistinguishable from FP16."
The i-matrix is optional but essentially free to compute (~5 minutes on a calibration set). Most modern GGUF releases on Hugging Face ship i-matrix-quantized variants.
Reading per-quant size and quality numbers fluently — Q4_K_M ≈ 4.5 bits, Q5_K_M ≈ 5.5, Q6_K ≈ 6.6, Q8_0 ≈ 8.5 — is the price of admission for any local-LLM conversation in 2026.

Mobile VLMs →

A VLM (vision-language model) is an LLM with a vision encoder and a learned projector glued in front. Image goes through CLIP/SigLIP → projector → tokens → LLM. Output is text. Fits in any chat-shaped runtime.
The 2026 mobile zoo: MiniCPM-V-2.6 (8B, ~5 GB at Q4_K_M, the practical pick), Phi-3.5-Vision (4.2B, ~3 GB, faster but less accurate), LLaVA-Mobile / MobileVLM (1–3B, ~2 GB), PaliGemma (3B, Google's open VLM, hard to beat for OCR).
The projector is a 2- or 3-layer MLP that maps CLIP/SigLIP features (1152-dim) to LLM token-embedding space (4096-dim for Llama-3-8B). It's tiny — ~10 MB. Most of the size is still the LLM.
Image preprocessing is the silent failure mode. Each VLM uses a specific resolution and aspect-ratio scheme (Phi-3.5-Vision: dynamic 336×336 tiling; MiniCPM-V: 448×448 with adaptive aspect ratios). Use the wrong preprocessor and accuracy collapses without erroring.
Mobile VLMs ship in ggml/llama.cpp via separate mmproj files (the projector + image encoder packaged separately from the LLM weights). Both files are required at load time.

Whisper.cpp & On-Device Speech →

Whisper.cpp is Georgi Gerganov's port of OpenAI Whisper to the same ggml runtime that powers llama.cpp. Quantized models (Q5_1 / Q8_0) hit real-time-or-better on a phone CPU; Metal / Vulkan / CUDA backends 2–4× that on a GPU.
The runtime exposes a tiny C API (whisper_init_from_file, whisper_full, whisper_full_get_segment_text) and bindings for every language. ~1 MB binary, ~244 MB for small.en Q5_1.
Voice activity detection (VAD) is the unsexy piece that makes streaming actually work — Silero-VAD or webrtcvad runs in under 1% CPU and detects speech-vs-silence in 30 ms windows. Without VAD, you transcribe silence and waste battery.
Chunking is the rest of the streaming story: feed Whisper 5–30 s windows with overlap, deduplicate the overlap, emit partial text as the latest chunk arrives. The Whisper.cpp stream example codifies the pattern.
The right model size: small.en (244 MB) for English-only on phone, base (74 MB) if you're memory-constrained, medium (~770 MB) on a laptop. large-v3 only on a desktop GPU.

Apple Neural Engine →

The Apple Neural Engine (ANE) is a fixed-function NPU on every iPhone since A11 (2017) and every M-series Mac. 16-core in A17/M3 at ~35 TOPS (INT8); M4 bumps the same 16-core ANE to ~38 TOPS. Lives next to the CPU and GPU on the same SoC die. (When you see "100+ TOPS" in Apple marketing, that's the full SoC aggregate across CPU + GPU + ANE; the ANE silicon itself is ~38 TOPS.)
Apple does not document the ANE op set publicly. What runs on ANE is determined empirically — Core ML's compiler decides, you observe via Instruments.
ANE prefers: convs, matmuls, FP16 attention, INT8 quantized weight-only, fixed-shape inputs, and tensors with fewer than 16K elements per dimension. It rejects: dynamic shapes, exotic activations (gelu-fast variants), some attention layouts, large embedding tables.
Models compiled with compute_units=ALL automatically partition between ANE / GPU / CPU. Use compute_units=CPU_AND_NE to force "ANE or fail" — the right setting for benchmarking.
The ANE profiler (Xcode → Instruments → Core ML template) is the only ground truth. It shows per-op compute-unit assignment and timing. Anything else is guessing.

Qualcomm Hexagon →

Hexagon is Qualcomm's DSP / NPU lineage. Every Snapdragon (and many Qualcomm IoT chips) ships with a Hexagon Tensor Processor (HTP). On Snapdragon 8 Gen 3 / 8 Gen 4: ~45 TOPS at INT8.
The programming model is QNN — Qualcomm Neural Network SDK. Models lower to a Hexagon-specific binary (.bin) executed by the QNN runtime on Android.
INT8 is the bread and butter; INT4 is supported on the latest generations. FP16 works but is generally slower than INT8 on this NPU. Quantization-aware training (QAT) is more important here than on Apple silicon.
Frameworks plug in as delegates: <Term name="executorch">ExecuTorch</Term> ships a Hexagon delegate; ONNX Runtime + QNN execution provider is widely used; TFLite has Hexagon support since 2019.
Snapdragon-shipped phones cover ~80% of high-end Android in 2026. Knowing Hexagon is the price of "ship LLMs to Android" outside the iPhone bubble.

Core ML & ANE →

Core ML is Apple's on-device ML framework. Models are .mlpackage (the modern format, replaces .mlmodel). Run on CPU, GPU (via Metal), or Apple Neural Engine (ANE) — Apple's NPU. The framework dispatches per-op based on what the chosen compute unit can do.
The Apple Neural Engine is the silicon that Apple's iOS Intelligence runs on. ~35 TOPS at INT8 on A17/M3, ~38 TOPS on M4. (The "100+ TOPS" you see in Apple marketing is the full-SoC aggregate across CPU + GPU + ANE, not the ANE alone.) Privacy-preserving (on-device, never the cloud). Heavily optimized for transformer inference in M3/M4 generation.
`coremltools` is the conversion toolkit: PyTorch / TensorFlow / ONNX → .mlpackage. The 2024+ version supports torch.export-shaped graphs and palettized (LUT-quantized) weights.
MLX is Apple's research-grade framework — like PyTorch but Apple-native, lazy-evaluated, runs on Apple silicon's unified memory. Used for training and prototyping; deploy via Core ML for production.
For 2026 iOS apps: Core ML is the path when you need ANE; ExecuTorch (with Core ML delegate) is the path when authoring fluency wins; llama.cpp is the path when binary size and ANE-independence matter.

ExecuTorch →

ExecuTorch is PyTorch's mobile runtime — the answer to "I have a PyTorch model, how do I run it on iOS / Android with no Python?"
The pipeline: torch.export() → EXIR (Edge IR) → quantize via PT2E → partition to backends (XNNPACK CPU, Core ML ANE, Vulkan GPU, Hexagon NPU) → serialize as .pte.
The .pte file is everything the device needs: serialized graph + per-backend kernels + weights. Few-hundred-KB runtime + your model size = total app footprint.
Vendor backends plug in as <Term name="mlir">MLIR</Term>-shaped delegates — Apple ships a Core ML delegate, Qualcomm ships a Hexagon delegate, MediaTek ships a Genio delegate. The partitioner decides which subgraph runs where.
For 2026 production: ExecuTorch is the default for PyTorch → mobile at Meta and many OEM partners (Samsung, Asus). It's where new mobile-AI work in the PyTorch ecosystem starts.

llama.cpp Internals →

llama.cpp is a pure-C++ inference runtime (started by ggerganov, 2023). No PyTorch. No Python. ~50K lines of code; runs Llama-class models on every reasonable platform from a Raspberry Pi to a server GPU.
<Term name="gguf">GGUF</Term> is the file format: weights + tokenizer + chat template + metadata in a single mmappable file. Quantized formats (Q4_K_M, Q5_K_M, Q8_0) are baked in. The format is the standard for local LLMs in 2026 — every consumer-facing tool reads GGUF.
<Term name="mmap">mmap</Term>-first design: the model file is mmap()'d, not read. The OS pages weights in on demand. Cold-start to first token in seconds, not minutes, even for a 70B model on disk.
Backends compose: a single binary runs the same model on CPU (AVX-512, NEON), Apple Metal, Vulkan, CUDA, ROCm. The framework dispatches per-tensor based on the available backend.
<Term name="k-quants">K-quants</Term> are the format's killer feature: better quality than naive INT4 at the same bit budget, baked into the binary, no calibration data needed. Q4_K_M is the universal default.

TFLite & LiteRT →

TFLite is the runtime that ships in every Android device with Google Play Services — billions of installs. As of 2024 it's been renamed to LiteRT ("Lite Runtime") and the Python/Android packages migrated to litert namespaces. Same code, new branding; existing TFLite code still works.
The runtime is tiny (~1 MB binary), reads .tflite flatbuffer files, and exposes a uniform Tensor I/O API across C++, Java/Kotlin, Swift, Python, and JS.
Performance comes from delegates — pluggable backends that move ops off the CPU. The four that matter: XNNPACK (CPU SIMD), GPU delegate (OpenCL/OpenGL/Metal), NNAPI (Android's NPU/DSP routing layer), Hexagon delegate (Qualcomm direct).
Quantization is the same INT8 / INT4 story as elsewhere; the TFLite-specific bit is representative-dataset post-training quantization (PTQ-PCM) and the on-device-friendly selective op fallback when a delegate doesn't support an op.
Use TFLite/LiteRT when: you're shipping to Android, you want NNAPI's automatic NPU routing, your model is a CNN/Transformer that fits ONNX-style ops cleanly. Use llama.cpp instead for chat-style LLMs (the GGUF + KV-cache stack is purpose-built for that).

01 · Systems Foundations

Branch Prediction →

Modern CPUs are deeply pipelined — 15–20 instructions in flight simultaneously. When the CPU hits a branch (if, for, while), it doesn't know which path will execute, so it guesses and starts <Term name="speculative execution">speculatively running</Term>.
If the guess is right: ~free, the speculation commits.
If the guess is wrong: pipeline gets flushed, ~10–20 cycles wasted. A misprediction costs 5–10× a regular instruction.
Modern predictors are really good — branches that follow patterns (always taken, predictable loops, simple conditions) hit ~99% accuracy.
Branches that depend on data with no pattern (sorting random values, hash lookups, RNG branches) hit ~50%. A 1B-iteration loop with 50% mispredict can spend half its time in pipeline stalls.

Cache Lines →

A cache line is the unit of memory transfer between RAM and CPU caches: typically 64 bytes on x86 / ARM, occasionally 128. You never read 1 byte from RAM; you always read 64.
Sequential access uses every byte of each cache line you fetch → bandwidth-efficient. Strided access wastes 87% of the bandwidth (using 8 bytes of every 64-byte fetch).
L1 (32–64 KB), L2 (256 KB – 2 MB), L3 (32 MB – 100+ MB), DRAM (TB) — each level is ~5–10× slower and ~10× larger than the one above. Knowing this hierarchy is the entire CPU performance story.
Cache-friendly = arrays > pointer chasing. Linked lists, hash tables with chaining, tree traversal all suffer from <Term name="cache miss">cache misses</Term>; flat arrays and contiguous structs win on modern hardware.
ML systems engineering applies the same idea on GPU: HBM cache lines are 32–128 bytes; the same locality discipline that wins on CPU wins on GPU coalescing.

NUMA & Topology →

A modern server is multiple sockets (CPUs), each with its own memory controller. Memory attached to one socket is local to it; memory on the other socket is remote — accessing it costs ~2× more latency and ~50% bandwidth.
This is <Term name="numa">NUMA</Term> (Non-Uniform Memory Access). Linux exposes it via /sys/devices/system/node/. Tools: numactl, lstopo, numastat. Pinning processes to NUMA nodes can be a 1.5–3× win on memory-bound workloads.
The same idea scales to GPUs: a multi-GPU node has a topology (<Term name="nvlink">NVLink</Term> mesh, <Term name="nvswitch">NVSwitch</Term>, PCIe). GPU 0 and GPU 1 might be 600 GB/s apart (NVLink); GPU 0 and GPU 7 might be 200 GB/s (one NVSwitch hop).
Across nodes, <Term name="infiniband">InfiniBand</Term> / Ethernet <Term name="rdma">RDMA</Term> add another tier: 25–400 Gb/s between nodes. The topology of a frontier training cluster matters as much as raw GPU count.
<Term name="cxl">CXL</Term> (Compute Express Link, 2019+) is the emerging fabric for cache-coherent shared memory across CPUs and accelerators. Production rollout 2024–2026; reshaping how multi-host AI systems are built.

Profiling Tools →

You can't optimize what you can't measure. Profiling is the discipline of turning "this is slow" into "this specific thing is using N% of cycles for reason X."
CPU profilers: perf (Linux), Instruments (macOS), VTune (Intel). Sample-based — interrupt periodically, record the call stack, aggregate.
GPU profilers: NVIDIA Nsight Systems (nsys) for whole-program timeline; Nsight Compute (ncu) for per-kernel deep dive. AMD has rocm-smi + rocprof.
<Term name="ebpf">eBPF</Term> tools — bpftrace, BCC, Pixie — let you instrument the running kernel without modifying source. Modern Linux observability.
PyTorch `torch.profiler` is the AI-specific tool: layer-level traces, kernel attribution, memory snapshots, integration with Chrome's perfetto for visualization.
Always profile before optimizing. "I thought it was the matmul; turned out it was a Python list comprehension" is the most common debugging story.

PyTorch Caching Allocator →

`cudaMalloc` and `cudaFree` are slow — ~50 μs and ~10 μs respectively. Calling them on every forward / backward step would dominate small-model training.
The PyTorch caching allocator is a userspace pool: it grabs large chunks from cudaMalloc upfront, hands out sub-allocations, and reuses them as tensors come and go. `torch.empty(N)` typically costs 0–1 μs in steady state.
The same pattern shows up everywhere production C++ goes fast: jemalloc, tcmalloc, <Term name="arena">arena allocators</Term>, slab allocators. "Don't call malloc on the hot path" is the universal performance discipline.
The PyTorch allocator splits, merges, recycles blocks; tracks per-stream allocations to avoid hazards on async work; offers an "expandable segments" mode (2024+) that grows as needed.
Read `torch.cuda.memory_summary()` to see what the allocator is doing. Fragmentation, peak allocation, and reserved-vs-allocated are the metrics that matter.

Move Semantics →

A copy of a vector<float> of size N allocates a fresh heap buffer and memcpys N floats. A move swaps internal pointers and leaves the source empty. Same result; ~10000× faster.
C++11 added rvalue references (`T&&`) and std::move to express "I'm done with this; you can take its guts." Modern C++ depends on them everywhere — every standard container, every smart pointer.
The rule of zero is the modern C++ design ideal: write classes that don't need a destructor, copy constructor, or move constructor — let the compiler synthesize them from member types. PyTorch's tensor classes follow this almost completely.
Return-value optimization (<Term name="rvo">RVO</Term>) has made many naive returns "free" for years; move semantics fill in the gaps RVO can't reach (e.g., returning one of two locals).
For ML systems code: heap allocations on a hot path are the enemy. Move semantics + small-buffer optimization + arena allocators are the toolkit for keeping tensors out of malloc.

Smart Pointers & RAII →

RAII = Resource Acquisition Is Initialization. The C++ pattern where a resource (memory, file, lock, mutex) is owned by an object whose <Term name="destructor">destructor</Term> releases it. The single most important C++ idea.
`std::unique_ptr<T>` owns a heap allocation. Non-copyable, movable. Destructor calls delete. Zero overhead vs raw pointer. Default for "this is mine."
`std::shared_ptr<T>` is reference-counted ownership — destructor decrements the refcount; when it hits 0, delete. ~24 bytes overhead per pointer + atomic refcount operations on copy / destroy. Default for "shared between threads / scopes."
`std::weak_ptr<T>` is a non-owning observer for shared_ptr. Used to break cycles.
ML systems code uses unique_ptr for owned buffers, shared_ptr for tensor storage (PyTorch's intrusive_ptr is a custom variant), and almost never raw new / delete.

Stack vs Heap →

The stack is fast, automatic, and tiny — like a stack of plates that the function call itself manages.
The heap is large, flexible, and slow to allocate from — you ask for memory, you (or RAII) give it back.
A stack allocation is ~1 ns; a heap allocation is ~50–500 ns. The 100× gap is the lesson.
ML runtimes obey one rule on the hot path: don't touch the allocator. Pre-allocate everything at startup and reuse.

Async Runtimes →

Async lets one thread juggle thousands of concurrent operations by suspending when blocked (waiting for I/O, a future, a timer) and resuming when ready. <Term name="coroutine">Coroutines</Term> are the language-level mechanism.
A runtime schedules suspended coroutines onto threads. Python's asyncio, Rust's tokio, C++ coroutines + executors, JavaScript's <Term name="event loop">event loop</Term> — all the same model with different syntax.
Async wins when most of your time is spent waiting — network I/O, database queries, LLM API calls, file reads. For pure CPU work, threads (or processes) are still the right choice.
<Term name="cuda stream">CUDA streams</Term> are GPU async: a stream is an ordered queue of GPU operations; multiple streams run concurrently on the same GPU. Every modern PyTorch / CUDA program implicitly uses streams; explicit stream control unlocks overlap-comm-with-compute optimizations.
Modern serving stacks (vLLM, FastAPI, SGLang) are entirely async on the orchestration side. Hundreds of concurrent connections handled by one process; the heavy compute is dispatched to threads/GPUs that run in the background.

SIMD →

<Term name="simd">SIMD</Term> = Single Instruction, Multiple Data. One CPU instruction operates on a whole vector of values. <Term name="avx">AVX-512</Term> does 16 FP32 ops per cycle; ARM SVE2 is similar; Apple <Term name="amx">AMX</Term> adds matrix-shaped ops.
It's a third level of parallelism below threads and processes — same core, multiple lanes per instruction.
Compilers auto-vectorize simple loops with -O3 -march=native. They struggle with: pointer aliasing, conditional branches, non-trivial reductions. Anything they don't vectorize, you write <Term name="intrinsic">intrinsics</Term> for.
*Pyodide / numpy / `__builtin_` intrinsics** are the layered API. Numpy's array operations dispatch to SIMD-optimized C; Eigen / xsimd / highway / std::experimental::simd give portable C++ access.
For ML on CPU: llama.cpp's <Term name="neon">NEON</Term> / AVX-512 paths are the production reference. SIMD is what makes "decode a 7B Q4_K_M on a phone" runnable.

Threading →

A thread is a kernel-scheduled unit of execution sharing the address space of a process. Creating one costs ~10 μs; switching between two costs ~1–10 μs depending on cache thrashing.
Don't fan out to one thread per task. Use a thread pool sized to the core count; queue tasks. The thread-per-task model dies under any real workload.
Synchronization primitives: <Term name="atomic variable">atomics</Term> (cheapest, lock-free), <Term name="mutex">mutexes</Term> (general-purpose), condvars (wait/signal), semaphores (counters), barriers (group sync). Pick by the data structure, not the syntax.
<Term name="lock contention">Lock contention</Term> is the silent killer. Two threads fighting over one mutex serialize each other; you've added overhead with no parallelism. The fix is usually: shard the data, use lock-free structures, or relax the consistency requirement.
The Python <Term name="gil">GIL</Term> prevents true CPU parallelism in pure Python — but releases for C extensions (NumPy, PyTorch, etc.). For Python-heavy work use multiprocessing; for C-heavy work threading is fine.

04 · LLM Architecture

FlashAttention-3 →

Standard attention writes the $N \times N$ matrix to HBM, then reads it back. That's $O(N^2)$ memory traffic and is the bottleneck — not the FLOPs.
FlashAttention-1/2 tile the computation, never materialize the full $N \times N$ matrix in HBM, and use the online softmax trick to fuse softmax with matmul.
FlashAttention-3 (2024) rewrites for Hopper: TMA async copies, warp-specialized producer/consumer, FP8 with block-scaled accumulation. ~75% of H100 peak for fp16/bf16, and FP8 throughput nearly doubles that.
Net effect on inference: same model, same numbers — 2–4× faster attention layer, longer contexts feasible, lower memory pressure.

GQA, MQA & MLA →

MHA has one K/V head per Q head — every token's KV cache scales with $H$ heads. Memory bottleneck.
MQA (Shazeer, 2019): one K/V head shared across all Q heads. Cache shrinks $H\times$. Quality drops noticeably at scale.
GQA (Ainslie et al., 2023): groups of Q heads share K/V. Default in Llama-3, Mistral, Qwen-2.5 — cache shrinks 4–8×, quality matches MHA.
MLA (DeepSeek-V2/V3, 2024): K/V projected into a low-rank latent space, decompressed only when needed. Cache shrinks another 5–10× vs GQA, with quality matching or exceeding MHA.
The tradeoff is always cache size vs quality. As of April 2026, GQA is the floor and MLA is the frontier — DeepSeek-V3 (Dec 2024) and Kimi-style follow-ups have proven MLA at 600B+ scale, and other open-frontier labs are evaluating it for their next-gen long-context models.

Multi-Head Attention →

An attention head is a triplet (Q, K, V) of projections from d_model → d_head. MHA runs `H` of them in parallel and concatenates the outputs.
The whole thing is one big matmul, one reshape, one scaled-dot-product, one concat, one matmul. There is no separate code path per head — heads are an axis, not a module.
The scale factor 1/√d_k exists so the softmax doesn't saturate as d_head grows; without it, training a wide head diverges.
Per-token cost is O(N · d_model²) for the projections and O(N² · d_model) for attention. The N² term is what you spend the rest of this track learning to tame (GQA, KV cache, FlashAttention, paged, chunked prefill, MLA).

RoPE + YaRN / LongRoPE →

RoPE (Rotary Positional Embedding, Su et al., 2021) encodes position by rotating query and key vectors in 2D pairs by an angle proportional to position. The dot product $\langle q_m, k_n \rangle$ then depends only on the relative position $m - n$.
It's the de facto standard in 2024–2026: Llama family, Mistral, Qwen, DeepSeek, Gemma all use it.
The base wavelength $\theta$ controls how far the model can extrapolate. Standard $\theta = 10{,}000$ caps at ~4–8K tokens. Beyond that, attention scores collapse.
NTK-aware / YaRN / LongRoPE are the three rescaling techniques that extend a model's context to 32K → 128K → 1M tokens with light continued pretraining.
For inference, PI (Position Interpolation), NTK-by-parts, and YaRN are the three you'll see in real configs. LongRoPE (2024) is the current frontier for 2M+ token contexts.

Chunked Prefill →

Pre-2024 servers ran prefill and decode as separate, alternating phases. A 16K-token prefill stalled every decoder behind it for hundreds of milliseconds.
Chunked prefill slices each prefill into fixed-size chunks (e.g., 2048 tokens) and packs them into the same forward pass as ongoing decodes. The GPU runs one big batched matmul; everyone makes progress.
vLLM v1 was rebuilt around this primitive. It is the default scheduling mode — there is no longer a "prefill phase" or a "decode phase," only "the next batch."
Net effect: <Term name="tpot">TPOT</Term> becomes stable (no more decode stutter when a long prompt arrives), <Term name="ttft">TTFT</Term> degrades gracefully (long prompts pay slightly more by being chunked), throughput holds within ~5% of pure-prefill.
Disaggregation (disaggregated serving) is the split-into-two-pools answer; chunked prefill is the one-pool-done-right answer. They're complements, not competitors — small fleets use chunked; big fleets disaggregate and chunk inside each pool.

Sampling →

The model outputs logits — one number per vocab token. Sampling is everything that turns those logits into a single chosen token.
Temperature rescales logits before softmax. T < 1 sharpens; T > 1 flattens; T = 0 is greedy argmax.
Top-k keeps only the k highest-probability tokens, renormalizes, samples. Crude but cheap.
Top-p (nucleus) keeps the smallest set whose cumulative probability ≥ p, renormalizes, samples. The default for Llama / Mistral / GPT for years.
Min-p keeps tokens whose probability is ≥ p_min × max_prob. Adapts to the distribution's confidence — wider when the model is uncertain, narrower when it's sharp. The 2024–2026 frontier; default in DeepSeek-R1, Qwen-3, and llama.cpp.
Order of operations matters. Logit penalties → temperature → top-k → top-p / min-p → sample. Doing it in the wrong order changes the result.

Speculative Decoding →

A small draft model generates $K$ candidate tokens. The big target model verifies all $K$ in a single forward pass. Accepted tokens become output; rejected ones trigger a fallback sample.
The math: a single $K$-token target forward is much cheaper than $K$ sequential forwards. As long as the draft is right most of the time, you get 2–3× decode speedup with zero distribution change.
EAGLE-3 (2024–2025) is the current frontier — auto-regressive draft heads attached to the target, ~80% acceptance rate, ~3× speedup on common workloads.
Medusa uses parallel multi-token prediction heads on the target itself. Simpler than draft-model methods; ~2× speedup.
Lookahead / PLD (Prompt Lookup Decoding) needs no draft model — speculates from n-grams in the prompt. Free at inference time; works best on summarization / repetitive output.

Structured Output →

"Respond in JSON" prompts still produce malformed output 5–15% of the time on current frontier models. That's a production failure, not a glitch.
Constrained decoding compiles your schema/grammar into a finite-state machine, then at every decode step masks logits for tokens that would violate the grammar. Output is guaranteed parseable.
XGrammar (2024) is the current speed champion: a context-free grammar engine with a token-mask cache that makes the overhead under 5% of decode time. Default in vLLM v1, SGLang, TensorRT-LLM.
Outlines (2023) pioneered the FSM-mask approach. Slower than XGrammar in some cases; still widely used for its Python-level ergonomics.
llama.cpp's GBNF is the same idea for the local-LLM world. Same speed regime, same correctness guarantee.
Don't use it on reasoning traces. Constrained decoding hurts free-form thought; apply it only at the final-answer extraction step.

Disaggregated Serving →

Prefill is compute-bound (long parallel matmul on the prompt). Decode is memory-bandwidth-bound (one token at a time, re-reading the whole KV cache). Running them on the same GPU forces a compromise both jobs hate.
Disaggregated serving = two GPU pools. The prefill pool runs prompts to completion, then transfers KV to a decode pool that streams tokens to the user. Each pool is sized and tuned for its job.
Pioneered as a research idea in DistServe (OSDI 2024) and shipped at scale in Mooncake (Moonshot AI, 2024) and vLLM-disagg / SGLang-disagg (2024–2025). By 2026 it's standard in any frontier-model serving stack.
The KV transfer between pools is the crux. NVLink intra-node, RDMA / NIXL inter-node. Round-trip cost is usually 5–30 ms — hidden behind the first decode step.
Net win: 3–4× more decode throughput at the same SLO, or 2× tighter TTFT at the same throughput. Big wins for long contexts and SLOs that bound TTFT and TPOT (time-per-output-token) separately.

KV Cache Basics →

Without a KV cache, every token would re-attend over the entire prompt. With it, each new token does one extra row of attention math.
The cache is 2 × n_layers × n_kv_heads × head_dim × seq_len × dtype_bytes. For Llama 3.1 70B at 32K context that's ~10 GB — often more than the weights.
The cache lives on the GPU and grows linearly with each generated token. Long contexts OOM on the cache long before they OOM on weights.
Every modern serving optimization (PagedAttention, prefix caching, MLA, KV quantization) is fundamentally about managing this one buffer.

PagedAttention (vLLM) →

The naive KV cache reserves a contiguous slab equal to the max context for each sequence. Most sequences are shorter — 60–80% of GPU memory is wasted on internal fragmentation.
PagedAttention (Kwon et al., SOSP 2023, the vLLM paper) treats the cache like an OS treats virtual memory: small fixed-size blocks + per-sequence block tables + a block allocator.
Result: 2–4× higher throughput than HuggingFace TGI on the same hardware. vLLM's defaults match the paper's design.
vLLM v1 (late 2024) refactored the engine around chunked prefill + PagedAttention as the unified primitive. SGLang's RadixAttention is the alternative paradigm — same paged storage, plus prefix-tree sharing.
This is the single most impactful kernel-level change in LLM serving since FlashAttention. Required reading.

Prefix & RadixAttention →

Most production traffic isn't unique. Two requests share a system prompt; ten requests share a chat history; a hundred requests share a few-shot template. Prefill those shared tokens once, reuse the KV across requests.
vLLM v0.5+ ships automatic prefix caching (APC): every block whose contents have been seen before is served from a hash table instead of recomputed.
SGLang's RadixAttention generalises APC to a radix tree — every prefix of every conversation is a node, every block is a tree edge, eviction is LRU on the tree. Sharing extends down the entire tree, not just the root prompt.
Real workloads see 3–10× throughput on TTFT (time-to-first-token) when prefix hit rate is high. Chatbots, agents, evals, and few-shot inference are all prefix-heavy by nature.
Cache reuse is correct only when the prefix tokens are byte-identical (after tokenization). Tokenizer drift, hidden whitespace, RNG-seeded prompts → cache miss.

02 · ML Execution & Quantization

GEMM (Hopper / Blackwell) →

GEMM = General Matrix Multiplication. `C = α·A·B + β·C`. ~80% of LLM training and inference flops live here.
A naive matmul reads each input element $O(N)$ times — fully memory-bound. Production kernels do cache blocking at multiple levels (registers, shared memory, L2) so each byte from HBM gets reused $\sim 64–256\times$.
On Hopper / Blackwell, the kernel structure is TMA + WGMMA + warp specialization — async copies overlap with tensor-core matmul, and producer/consumer warps split the work.
CUTLASS 4 (NVIDIA) and Triton 3.x are the two ways most teams write it in 2026. Hand-tuned CUTLASS still wins by 5–10% on edge shapes; Triton wins on developer time.
For LLM inference at small batch (decode), GEMM becomes GEMV — different optimization regime, memory-bound, where INT4/FP8 weight quant rules.

Shared Memory →

Shared memory (SMEM) is per-SM, programmer-managed, on-chip scratch. ~228 KB on H100, 256 KB on B200. Latency: ~30 cycles. Bandwidth: ~10× HBM.
Every fast kernel uses SMEM as a staging area: load a tile from HBM into SMEM, all the threads in the block access it many times, then write the result back. Without SMEM tiling, kernels are HBM-bandwidth-bound and run at under 10% of peak.
SMEM is split into 32 banks (one per warp lane). If two threads in a warp hit the same bank with different addresses → bank conflict → serialization. A single 32-way conflict cuts throughput by 32×.
The standard fix is swizzling: a permutation of the SMEM index that hashes consecutive threads to different banks. CUTLASS/CUTE do this for you; understanding the principle is what lets you debug a slow kernel.
On Hopper+, the TMA (Tensor Memory Accelerator) is a hardware async-copy engine that loads tiles into SMEM in the background while the rest of the warp computes. This is the foundation of every modern AI kernel.

SM Architecture →

A modern GPU is a fleet of Streaming Multiprocessors (SMs). Each SM is its own little processor: 4 warp schedulers, 64–128 CUDA cores, 4 tensor cores, a register file, and shared memory. Everything that runs on a GPU runs on one SM at a time.
The warp is the real unit of execution: 32 threads marching in lockstep through one instruction at a time. Threads in a warp can't truly diverge — they take turns when they branch. A kernel that thinks "thread" is the unit is going to be slow.
Tensor Cores do the heavy lifting in modern AI: 4×8×16 (or larger) FP16/BF16/FP8 matrix-multiply-accumulate per cycle per core. Hopper has 4 per SM; Blackwell has more and faster ones with FP4/FP6 support.
Occupancy is the fraction of an SM's potential warps that are actually resident. High occupancy hides memory latency; low occupancy with high arithmetic intensity is fine. Don't chase occupancy as an end in itself.
The same kernel on H100 (132 SMs) vs B200 (148 SMs) vs MI355X (288 CUs) sees different parallelism budgets. Tile size and CTA count are tuned per-chip.

Thread Hierarchy →

A CUDA kernel launches a grid of thread blocks (CTAs). Each block runs on a single SM. Each block is a set of threads, organized internally into warps of 32.
Hierarchy: grid → block → warp → thread. Each level has its own coordinates (blockIdx, threadIdx, etc.) and its own synchronization primitives.
The block-per-tile pattern is universal in modern AI kernels: each block computes one output tile (e.g., 128×128 of a matmul). The grid is (M/128, N/128). Index math turns block coordinates into tile coordinates.
Synchronization is hierarchical: __syncthreads() syncs a block, __syncwarp() syncs a warp, the CUDA driver syncs the grid. Cooperative groups generalize this on H100+ with distributed shared memory (DSMEM) — adjacent blocks share an SMEM cluster.
Triton hides this hierarchy. A Triton kernel sees one program per tile; the warp/thread layer is implicit and the compiler decides. Same mental model, less typing.

FP8 Inference →

FP8 is an 8-bit floating-point format (sign + exponent + mantissa). Two variants: E4M3 (4 exponent bits, 3 mantissa, range ~ ±448, used for weights/activations) and E5M2 (5 exponent bits, 2 mantissa, range ~ ±57344, used for gradients during training).
Replaced INT8 as the standard quantization format on H100+ because: it preserves exponent flexibility (no per-tensor calibration mess), tensor cores natively support it, and accuracy is meaningfully better at the same bit-width.
Hopper added FP8 tensor cores; Blackwell added FP4 / FP6 alongside. All modern AI compilers (CUTLASS, Triton, vLLM, FlashAttention-3) support FP8 weights + activations + KV cache.
Quality regression from BF16 → FP8 is typically under 0.5 points on standard benchmarks (MMLU, GSM8K). KV-cache FP8 quantization adds ~0 measurable regression for most models.
Per-tensor scaling is the magic — the format itself has limited range, but each tensor (or each block) carries a scale factor so the actual values can span any magnitude. The scale is stored in higher precision (BF16 or FP32).

INT4 / AWQ / GPTQ →

INT4 stores weights as 4-bit integers. Two weights per byte, 4× compression over BF16. The cost: more accuracy work — naive INT4 quantization regresses noticeably; modern recipes (AWQ, GPTQ) close most of the gap.
GPTQ (Frantar et al., 2022) is a post-training <Term name="quantization">quantization</Term> recipe: iteratively quantize weights using second-order info (Hessian-based) to minimize per-layer reconstruction error. Offline; needs a <Term name="calibration data">calibration set</Term>; produces ~0.5–2 pt regression on MMLU.
AWQ (Lin et al., 2024) is activation-aware: it identifies "salient" weights (those that hugely affect activations) and protects them by per-channel scaling. Often outperforms GPTQ at the same bit-width; faster to apply.
INT4 is the format for: 4-bit on-device inference (llama.cpp Q4_K_M, <Term name="gguf">GGUF</Term>), low-cost serving with very large models on a single GPU, edge AI. Not used for training — INT4 gradients are too lossy.
The kernel matters as much as the format. INT4 weights need fast unpack-and-multiply kernels (Marlin, exllama, AWQ-CUTLASS); without them, the "4× smaller" doesn't translate to "much faster."

MXFP4 / NVFP4 →

MXFP4 (Microscaling FP4) is an OCP-standardized 4-bit floating-point format with per-block scale factors. Each 32-element block has its own E8M0 (8-bit exponent) scale; the elements themselves are 4-bit floats (E2M1: 1 sign, 2 exp, 1 mantissa).
NVFP4 is NVIDIA's variant — same idea, slightly different scale factor (E4M3 instead of E8M0), block size 16 instead of 32. Higher accuracy at slightly higher metadata cost.
Blackwell tensor cores natively support these formats. MXFP4/NVFP4 weights × FP16 activations → FP16 accumulator at full tensor-core throughput. No software unpack-and-dequantize.
Quality vs INT4 AWQ/GPTQ: comparable on most benchmarks; better on long-tail outliers because the floating-point representation handles wide dynamic range that INT4 has to scale for.
Why this matters for 2026–2027: if Blackwell tensor cores are 2× faster on FP4 than FP8, and quality holds, FP4 becomes the default inference format. INT4 may become legacy.

Rotation Quantization →

LLM activations have outliers — a few channels with magnitudes 10–100× larger than typical. They wreck quantization (the scale gets pulled too wide; non-outlier values lose precision).
Rotation quantization applies an orthogonal matrix R to the activations and a compensating R⁻¹ to the next layer's weights. The math is unchanged, but the rotated activations have no outlier channels — their magnitudes are spread evenly.
Rotated activations quantize cleanly to INT4 / FP4 / FP8 without outlier-protection tricks like AWQ. QuaRot (Ashkboos et al., 2024) was the first published version; SpinQuant (Liu et al., 2024) learns the rotation; both deliver near-FP16 accuracy at INT4.
The rotation cost: one extra small matmul per attention block at runtime — typically under 2% throughput hit. The accuracy gain at low bit-widths can be 2–5 points on MMLU.
Where it matters most: aggressive 4-bit and below. For FP8, outliers usually don't break things. For 4-bit (MXFP4, INT4) and below, rotation is increasingly default — DeepSeek-V3 uses a related trick during training; production INT4 stacks like vLLM v1 ship rotation as an option.

Contiguous vs Non-Contiguous →

Contiguous means: walking through the tensor in its standard order also walks through memory in order. Cache lines are reused; coalesced GPU loads work.
Non-contiguous = the iteration order doesn't match memory order. Cache misses on every step; GPU loads serialize. Slowdown is 5–20× depending on tensor size and access pattern.
tensor.is_contiguous() checks; tensor.contiguous() returns a new contiguous copy if needed (or the same tensor if it already is).
The most common source of non-contiguity is transpose() followed by an op that walks the wrong axis. Adding .contiguous() after the transpose is the fix; sometimes the right answer is to not transpose and use bmm with the right axes instead.
CUDA kernels that aren't aware of strides typically require contiguous input — they index with base + i*stride assuming stride is implicit. Pass non-contiguous → silent garbage or seg fault.

Strides & Layout →

A tensor is (buffer, shape, strides, dtype, offset). The buffer holds the bytes; shape and strides decide how to walk them.
Stride[i] = how many elements to skip to advance by 1 along axis i. For a contiguous row-major (M, N) tensor: strides are (N, 1). Column-major: (1, M).
Most tensor ops can be done without copying by changing strides: transpose, view, permute, expand, slicing along an axis. That's why T.transpose(0, 1) is O(1) — it just swaps strides.
The cost shows up later: a non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one because it ruins cache locality and breaks coalesced GPU loads.
Almost every "why is my model slow?" mystery in PyTorch traces back to strides. A transpose followed by a view blows up with a stride error; an as_strided on a temporary creates a footgun. Reading strides is reading the cost model.

Build a Tensor Library →

A tensor library is smaller than people think: ~200 lines of Python gets you strided views, broadcasting, and a handful of ops good enough to train MNIST.
The four types you need: Buffer (raw memory), Tensor (buffer + shape + strides + offset), Op (forward + backward), Engine (the autograd graph + topological execution).
Once you've written one yourself, every "magic" in PyTorch / JAX becomes legible: shape errors, stride bugs, retain_graph weirdness, why .detach() exists.
Real production libraries (PyTorch, JAX, tinygrad) add: dispatch to backends (CUDA, MPS, ROCm), op fusion, JIT compilation, distributed semantics. All of this is bolted onto the same core abstraction.
The lesson companion is the module capstone — a 200-line tinygrad clone running MNIST.

TMA & cp.async →

The single biggest perf insight in modern GPU kernels: don't make threads wait for memory. Use a hardware engine to copy bytes from HBM to SMEM in the background while the warp keeps computing.
`cp.async` (Ampere, 2020) was the first such instruction: a thread issues an async copy from HBM to SMEM, then later does a cp.async.wait_group when it actually needs the data. While waiting, the warp can do other work.
TMA — Tensor Memory Accelerator (Hopper, 2022) generalizes this: one thread issues a copy of an entire 2D/3D tile (described by a precomputed TMA descriptor), and a hardware engine performs the entire copy. No 32-thread cooperation needed.
TMA + warp specialization + multi-stage pipeline is the recipe behind FlashAttention-3, every CUTLASS 4 GEMM, all of cuBLAS on Hopper, and most modern Triton kernels: producer warps issue TMA loads, consumer warps do the math, perfect overlap.
Blackwell extends the design with a CTA-cluster TMA: one TMA descriptor can populate SMEM across multiple CTAs in a cluster simultaneously, enabling new kernel topologies.

03 · Training & RLHF

Data Parallel & DDP →

Data parallel (DP) = each GPU holds a full copy of the model weights, processes a different slice of the batch, then averages gradients across GPUs at backward time.
PyTorch DDP is the canonical implementation. Wraps your model in one line; the framework adds AllReduce calls during backward.
The communication primitive is Ring-AllReduce — gradients flow around the GPU ring in two passes (reduce-scatter, all-gather). Cost: 2 × (N-1) × params / N bytes per GPU per step.
Gradient bucketing groups small parameter tensors into fixed-size buckets so the AllReduce traffic happens in a few large transfers instead of many tiny ones. ~10× higher achieved bandwidth.
Comm-compute overlap — start AllReduce on layer N's gradients while still computing backward on layer N-1. This is what gets DDP from "parallel but not scaling" to "70%+ efficient up to ~32 GPUs".

ZeRO & FSDP2 →

DDP replicates the full model on every rank. Out of memory at large scale.
ZeRO (Rajbhandari et al., 2020) progressively shards what's replicated: stages 1 (optimizer state), 2 (+ gradients), 3 (+ parameters). Stage 3 = the whole model lives across ranks; only the active layer is materialized.
FSDP is PyTorch's ZeRO-3 (2022). FSDP2 (2024) replaces module-level sharding with per-parameter sharding via DTensor, composes cleanly with TP/PP, much faster wraps, simpler code.
For a 70B BF16 model on 8 GPUs: DDP doesn't fit at all. FSDP2 fits with room for a 32K-token batch.
Default in production through 2026: FSDP2 (HSDP) — hybrid sharding within a node + replication across nodes — composed with tensor parallel and pipeline parallel for very large runs.

Pipeline & Zero-Bubble →

Pipeline parallel (PP) splits the model across GPUs by layer. GPU 0 holds layers 0–N, GPU 1 holds N–2N, etc. Activations flow forward through the pipeline; gradients flow backward.
The naive version (GPipe, Huang et al., 2019) has a bubble: when GPU N starts forward, GPU 0 is already done and idle. The first stages spend much of the step waiting.
1F1B (one-forward-one-backward) interleaves forward and backward to fill the bubble. Default in Megatron-LM and DeepSpeed since 2020.
Interleaved 1F1B / virtual stages: split each GPU's layers into multiple "virtual stages" so the bubble shrinks further. Megatron-Core's default for >8 stages.
Zero-bubble (Qi et al., 2024) is the 2024 breakthrough — a careful schedule that fully eliminates pipeline bubbles by separating gradient computation from weight updates. Now ships in TorchTitan and Megatron-Core.

Tensor Parallel →

Tensor parallel (TP) splits each individual matmul (and embedding, attention) across multiple GPUs. Each GPU holds a slice of every weight matrix; activations get AllReduced after the slice.
Megatron-LM introduced the canonical TP design (Shoeybi et al., 2019): column-parallel for the first matmul of an MLP, row-parallel for the second, two AllReduces per transformer layer (one in attention, one in MLP).
TP shards weights and compute → memory and FLOPs scale roughly linearly with TP degree. The cost is communication: AllReduce on the activations every layer, twice.
TP works best within a single NVLink domain (one DGX node, ~8 GPUs). Across nodes, comm bandwidth tanks and TP scaling collapses. Production: TP=8 within node, DP / PP across nodes.
The 2026 default for training a 70B+ model: TP=8 + FSDP (DP / sharded) + PP=4. This is what TorchTitan, Megatron-Core, NeMo all assume.

Backprop as a Graph →

Backprop isn't really about derivatives — those are bookkeeping. It's about executing a computational graph backward, with the forward activations as the binding state.
Activations dominate training memory. For a 7B model at 8K context, activations can exceed weights by 5–10× without checkpointing.
PyTorch's autograd records a tape of operations during forward; backward replays it in reverse, freeing tensors as soon as their gradient is computed.
VJP (Vector-Jacobian Product) is the right mental model — .backward() doesn't compute Jacobians, it computes $J^\top v$ for some upstream $v$, one op at a time.
The big systems consequence: gradient checkpointing trades compute for memory by recomputing activations during backward instead of storing them.

FP8 Training (DeepSeek-V3 Recipe) →

FP8 training runs forward and backward passes in 8-bit floats (E4M3 or E5M2), with a higher-precision master copy of weights and a scaling strategy to keep values in dynamic range.
On H100/B200, FP8 GEMM is ~2× the throughput of BF16 on tensor cores. End-to-end training is typically 1.4–1.8× faster wall-clock than BF16, with no quality regression at scale.
DeepSeek-V3 (Dec 2024) is the most-cited public recipe: per-tile/per-block scaling, FP8 forward + backward, FP32 master + FP32 reduce-scatter, "increased-accuracy accumulation" via partial sums in FP32.
Blackwell B200/B300 (2025) added NVFP4 (4-bit float) and <Term name="mxfp8">MXFP8</Term> (block-scaled 8-bit) in tensor cores, opening the door to 4-bit training experiments through 2025–2026.
Below-frontier scale, BF16 + FP8 inference is more common than FP8 training. The training recipe is heavyweight; test on a 1–7B model before committing.

LR Schedules & WSD →

Cosine decay was the default for GPT-3 through Llama 2: the LR follows a half-cosine from peak to ~0 over a fixed number of steps. Works well but requires knowing total training steps upfront.
Warmup-Stable-Decay (WSD) splits training into three phases: a short linear warmup (0.5–5% of steps), a long stable plateau at peak LR (80–90%), and a final cooldown (5–15%). MiniCPM, DeepSeek-V3, and Qwen 2.5 all use variants of WSD.
WSD's killer advantage: you don't need to decide total compute budget before training starts. You can train indefinitely at the plateau, then cool down whenever you want a strong checkpoint.
The cooldown phase does most of the final loss improvement — it suppresses gradient noise and lets the model settle into a sharper minimum. Skipping it costs 0.5–2% quality.

AdamW → Lion → Muon →

SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to 1× params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.

DPO / IPO / KTO →

DPO (Direct Preference Optimization) — Rafailov et al., 2023 — replaces PPO-RLHF with a closed-form classification loss on chosen vs rejected response pairs. No reward model, no RL, no rollouts. Same gradient direction; vastly simpler training.
The DPO loss derives from the same Bradley-Terry preference model RLHF uses, but with the reward function eliminated analytically — the policy is its own implicit reward model.
IPO (Identity Preference Optimization) — Azar et al., 2023 — a regularization fix for DPO that prevents over-optimization when preference data is noisy or near-tied.
KTO (Kahneman-Tversky Optimization) — ContextualAI, 2024 — uses a single thumbs-up / thumbs-down label per response (no pairs needed). Closer to real production feedback signal.
For 2026 production: DPO is the default for paired-preference data; KTO when you only have unary feedback. IPO and other variants come up when DPO drifts.

GRPO & RL Reasoning →

GRPO (Group Relative Policy Optimization) is the RL algorithm DeepSeek used for R1. It removes PPO's value function — uses group-relative advantages instead.
The training signal is verifiable: math problems checked by a parser, code checked by unit tests. No reward model, no human labels.
The model learns to emit long chain-of-thought by RL on this signal alone — DeepSeek-R1-Zero went from 15% → 71% on AIME with no SFT, just GRPO on math.
R1 final uses cold-start SFT → GRPO → SFT-on-best-rollouts → another GRPO pass. The "aha moment" — the model spontaneously starts saying "Wait, let me reconsider" — emerged purely from the RL signal.
OpenAI's o1/o3 family is widely believed to use a similar (proprietary) RL-on-reasoning recipe. As of April 2026, GRPO and its variants are the dominant post-training paradigm for reasoning models.

LoRA, QLoRA & DoRA →

LoRA trains two small low-rank matrices A and B instead of the full weight W. Update is W + (B·A) × scale. Storage and gradient memory drop ~100×.
QLoRA keeps the base model's weights frozen and quantized to NF4 (4-bit), only LoRA adapters are FP16. This is what lets you fine-tune Llama-3-70B on a single 80 GB GPU.
DoRA decomposes weight updates into magnitude + direction. Closes most of the gap between LoRA and full fine-tuning at small additional cost.
Production serving uses multi-LoRA (S-LoRA, vLLM --enable-lora) — one base model, many tenant-specific adapters hot-swapped per request.
Rule of thumb: rank r=16, alpha=32 is the sane default. Bump r to 64 only if you can't fit the task otherwise.

SFT & Instruction Tuning →

Supervised Fine-Tuning (SFT) is the first post-training step: you train on (instruction, response) pairs so the model learns to follow directions instead of just completing text.
Chat templates (ChatML, Llama 3 format) wrap conversations with special tokens that delineate system/user/assistant turns. Using the wrong template at inference = broken model. Always match training and serving templates.
Prompt loss masking zeros the loss on instruction tokens (set labels to -100). The model only learns to generate responses, not to memorize your prompts.
Sample packing concatenates multiple short conversations into one sequence, eliminating padding waste. With proper attention masking, this gives 2–4× training throughput on typical chat datasets.
Quality > quantity. 10K carefully curated examples often beats 1M noisy ones. The LIMA paper (2023) showed 1K high-quality examples suffice for strong instruction-following.