Optimization

Training is a graph-execution problem before it’s a calculus problem. This module rebuilds your mental model around what actually happens when .backward() runs, what each optimizer step costs in memory, and why FP8 is now the production default for frontier-scale runs.

0 / 4 lessons~57 min total

Backprop as a Graph14 min
AdamW → Lion → Muon15 min
LR Schedules & WSD12 min
FP8 Training (DeepSeek-V3 recipe)16 min