RL Frontier

The post-training paradigm is changing every six months. R1 in January 2025, o3 by end of year, agentic RL beginning to ship in 2026. This module is the live edge: what we know, what we suspect, what’s still open. Read these for direction — the specific algorithms here will be stale within 12 months; the structural ideas (verifiability, process reward, self-play, reward hacking) will not.

0 / 5 lessons~74 min total

Module capstone — build it

Train an agentic RL loop end-to-end

The capstone for the whole RL track. Train a small model to use tools with RL — the skill behind every reasoning agent in production by 2027.

Advanced4-6 weekendsSingle-GPU rental

Pick a verifiable agentic task (e.g. a math problem set where the agent calls a calculator tool, or a code task where it runs unit tests). Define the tool API. Sample multi-step rollouts with tool calls. Compute trajectory-level reward. Run GRPO with multi-turn advantage attribution. Demonstrate the model learned *when to call the tool*, not just *how* to use it. Write up the failure modes — reward hacking attempts, infinite tool-call loops, degenerate strategies.

Tools you'll use

verl multi-turn support
OpenAI gym wrapper around your tool
Same vLLM rollout stack
arXiv 2502.* agentic-RL recent papers