Policy Gradient & REINFORCE
The first time someone explains policy gradient you nod along. The second time you realize you have no idea why it works. The third time the log-derivative trick clicks and the rest of RL falls into place. This lesson is that third time. You will leave it able to derive from scratch — and able to explain why every modern LLM RL algorithm is “REINFORCE with three improvements”.
TL;DR
- The goal: maximize where is a sampled trajectory and is the policy parameters.
- Problem: the trajectory itself depends on (because generated it), so the gradient looks intractable.
- Solution: the log-derivative trick. . The gradient lives entirely in .
- REINFORCE: sample trajectories, compute , take a gradient step. ~10 lines of code.
- The problem with vanilla REINFORCE: variance is enormous. Subtracting a baseline (the next lesson) is what makes it practical. PPO, GRPO, and every modern method are baseline + clipping variants.
Why this matters
Every RL algorithm trained on LLMs in 2025–2026 — PPO, GRPO, RLOO, Reinforce++ — is structurally REINFORCE with (a) a baseline for variance reduction, (b) some form of importance correction to allow off-policy data, and (c) a KL penalty to stop the policy from drifting too far. If you understand REINFORCE, you understand 80% of the rest of the track.
The concept
Start with what you want: maximize expected return. The expectation is over trajectories sampled from your policy:
Differentiating:
Here’s the trick. Multiply and divide by :
Now we can estimate the gradient from samples — collect trajectories, average across them, that’s our estimator.
Because and only the policy depends on :
So the REINFORCE estimator is:
where is the (possibly discounted) return from timestep in trajectory .
The variance problem
REINFORCE works but is brutally high-variance. Two improvements close 90% of the gap:
- Subtract a baseline. . The estimator stays unbiased if doesn’t depend on the action. Best choice: , the value function. Then the advantage (next lesson).
- Use return-to-go, not full return. Actions can’t influence past rewards, so the past doesn’t belong in the gradient signal. where only.
With both improvements, REINFORCE becomes REINFORCE-with-baseline, which is structurally identical to actor-critic and to the policy-gradient inner loop of PPO.
Mental model
Sample → score → backprop through the log-prob → repeat. That’s the whole loop.
Key takeaways
- The log-derivative trick turns an intractable gradient through a sampler into a Monte Carlo expectation. This is the enabling move in RL.
- REINFORCE is the parent algorithm for PPO, GRPO, A2C, and every LLM RL method. The improvements are about variance and stability, not the basic structure.
- Variance is the practical bottleneck. Baselines + return-to-go + clipping + KL constraints are all variance/stability tools.
- It works for any sampler. You don’t need a differentiable environment — the environment can be a black box. This is why RL is the natural fit for LLMs with verifiers.
- One line of code, one big idea.
loss = -(logprob * advantage).mean(). Everything downstream is a refinement of this.
Go deeper
- PaperWilliams (1992) — Simple Statistical Gradient-Following Algorithms (REINFORCE)The original paper. Surprisingly readable; introduces the log-derivative trick.
- BlogLilian Weng — Policy Gradient AlgorithmsBest single survey. Read sections 1-2 for foundations, then come back to skim the rest.
- DocsOpenAI Spinning Up — Part 2: Kinds of RL AlgorithmsEngineer-friendly walkthrough of policy gradient. Includes runnable code.
- VideoAndrej Karpathy — Deep Reinforcement Learning: Pong from PixelsThe classic blog/talk that made REINFORCE click for a generation of engineers. 130 lines of NumPy.
- BlogKarpathy — Deep Reinforcement Learning: Pong from Pixels (blog post)The reading companion to the talk. Save and re-read.
- VideoKarpathy — Deep Dive on RL Training (2025)Modern walkthrough including LLM-specific framing. ~90 minutes.
TL;DR
- Goal: .
- Log-derivative trick: .
- REINFORCE estimator: .
- Variance reduction: subtract baseline ; use return-to-go.
Why this matters
REINFORCE is the parent of every LLM RL algorithm. PPO, GRPO, RLOO — all are baselined + clipped + KL-constrained REINFORCE.
Concrete walkthrough
Derivation (memorize this):
With baseline + return-to-go:
Python (the whole algorithm):
# trajectory: list of (s, a, r); policy: pi_theta(a|s)
logprobs = [pi.log_prob(a, s) for s, a, _ in trajectory]
returns = [sum(r for _, _, r in trajectory[t:]) for t in range(len(trajectory))]
loss = -(torch.stack(logprobs) * torch.tensor(returns)).mean()
loss.backward(); opt.step()That’s the entire algorithm. Every improvement (baseline, GAE, PPO clip, GRPO advantage) is a substitution into returns.
Key takeaways
- Log-derivative trick is the enabling move.
loss = -(logprob * advantage).mean()is the whole shape.- Variance reduction matters more than the algorithm name.
- Trajectories are samples — RL doesn’t need a differentiable environment.
Go deeper
- PaperWilliams (1992) — REINFORCEOriginal paper.
- BlogLilian Weng — Policy Gradient AlgorithmsSurvey of the family.
- DocsOpenAI Spinning Up — Part 2Runnable code reference.