The post-training paradigm is changing every six months. R1 in January 2025, o3 by end of year, agentic RL beginning to ship in 2026. This module is the live edge: what we know, what we suspect, what’s still open. Read these for direction — the specific algorithms here will be stale within 12 months; the structural ideas (verifiability, process reward, self-play, reward hacking) will not.