03Track 03 · Training & RLHF
How modern LLMs are actually trained, end to end.
The full pipeline behind a frontier model — backprop as a graph problem, the optimizer lineage from AdamW to Muon, FP8 training, the 4D parallelism mesh, SFT, LoRA/QLoRA, DPO, and the GRPO recipe that made DeepSeek-R1 work.
- — backprop, optimizers, LR schedules, FP8 training
- — data / tensor / pipeline parallel, ZeRO + FSDP2
- — SFT, LoRA/QLoRA, DPO, GRPO on reasoning
- Read a frontier-model tech report (DeepSeek-V3, Llama-3) and follow every word
- Fine-tune a 7B model on your own domain with QLoRA in a Colab notebook
- Reason about why a particular training stack picks a particular parallelism config
- Implement DPO from scratch and explain when to use it over PPO