PPO, GRPO, DPO, RLHF — none of it makes sense without the underlying machinery. This module is the math the rest of the RL track rides on: how to formalize “an agent making decisions”, how to compute a gradient through a sampled trajectory, why we subtract baselines, and what KL constraints buy you. Skip this only if you already know Sutton & Barto cold.