RL Foundations

PPO, GRPO, DPO, RLHF — none of it makes sense without the underlying machinery. This module is the math the rest of the RL track rides on: how to formalize “an agent making decisions”, how to compute a gradient through a sampled trajectory, why we subtract baselines, and what KL constraints buy you. Skip this only if you already know Sutton & Barto cold.

0 / 4 lessons~54 min total

MDPs & Bellman Equations12 min
Policy Gradient & REINFORCE14 min
Value Functions, Advantage & GAE14 min
Trust Regions, KL & Importance Sampling14 min