Module capstone — build it
Train your first GPT — and prove Muon beats AdamW
A 30M-param GPT trained on TinyStories from scratch in 4 hours of free Colab. Plot AdamW vs Lion vs Muon and stake a position.
Intermediate·One weekend (~4-6 hours of training)·Free Colab T4
Pretrain a tiny GPT (4 layers, 384-dim, 30M params) on the TinyStories dataset. Train three identical runs with AdamW, Lion, and Muon. Same data, same seed, same compute. Measure loss curves, optimizer-state memory, and final perplexity. Settle the argument.
Tools you'll use- PyTorch 2.x
- TinyStories dataset
- Muon optimizer (Jordan et al., 2024)
- wandb for tracking