Skip to content

Policy Gradient & REINFORCE

The first time someone explains policy gradient you nod along. The second time you realize you have no idea why it works. The third time the log-derivative trick clicks and the rest of RL falls into place. This lesson is that third time. You will leave it able to derive θJ(θ)=E[θlogπθ(as)R]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot R] from scratch — and able to explain why every modern LLM RL algorithm is “REINFORCE with three improvements”.

TL;DR

  • The goal: maximize J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] where τ\tau is a sampled trajectory and θ\theta is the policy parameters.
  • Problem: the trajectory itself depends on θ\theta (because πθ\pi_\theta generated it), so the gradient looks intractable.
  • Solution: the log-derivative trick. θJ=Eτ[θlogpθ(τ)R(τ)]\nabla_\theta J = \mathbb{E}_\tau[\nabla_\theta \log p_\theta(\tau) \cdot R(\tau)]. The gradient lives entirely in logπθ\log \pi_\theta.
  • REINFORCE: sample trajectories, compute tθlogπθ(atst)Gt\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t, take a gradient step. ~10 lines of code.
  • The problem with vanilla REINFORCE: variance is enormous. Subtracting a baseline b(s)b(s) (the next lesson) is what makes it practical. PPO, GRPO, and every modern method are baseline + clipping variants.

Why this matters

Every RL algorithm trained on LLMs in 2025–2026 — PPO, GRPO, RLOO, Reinforce++ — is structurally REINFORCE with (a) a baseline for variance reduction, (b) some form of importance correction to allow off-policy data, and (c) a KL penalty to stop the policy from drifting too far. If you understand REINFORCE, you understand 80% of the rest of the track.

The concept

Start with what you want: maximize expected return. The expectation is over trajectories sampled from your policy:

J(θ)=Eτπθ[R(τ)]=τpθ(τ)R(τ)J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_\tau p_\theta(\tau) R(\tau)

Differentiating:

θJ(θ)=τθpθ(τ)R(τ)\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta p_\theta(\tau) R(\tau)

Here’s the trick. Multiply and divide by pθ(τ)p_\theta(\tau):

=τpθ(τ)θpθ(τ)pθ(τ)R(τ)=τpθ(τ)θlogpθ(τ)R(τ)=Eτ[θlogpθ(τ)R(τ)]= \sum_\tau p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} R(\tau) = \sum_\tau p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) \cdot R(\tau) = \mathbb{E}_\tau[\nabla_\theta \log p_\theta(\tau) \cdot R(\tau)]

Now we can estimate the gradient from samples — collect trajectories, average θlogpθ(τ)R(τ)\nabla_\theta \log p_\theta(\tau) \cdot R(\tau) across them, that’s our estimator.

Because pθ(τ)=p(s0)tπθ(atst)P(st+1st,at)p_\theta(\tau) = p(s_0) \prod_t \pi_\theta(a_t|s_t) P(s_{t+1}|s_t,a_t) and only the policy depends on θ\theta:

θlogpθ(τ)=tθlogπθ(atst)\nabla_\theta \log p_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)

So the REINFORCE estimator is:

g^=1Ni=1Ntθlogπθ(at,ist,i)Gt,i\hat{g} = \frac{1}{N} \sum_{i=1}^N \sum_t \nabla_\theta \log \pi_\theta(a_{t,i}|s_{t,i}) \cdot G_{t,i}

where Gt,iG_{t,i} is the (possibly discounted) return from timestep tt in trajectory ii.

The variance problem

REINFORCE works but is brutally high-variance. Two improvements close 90% of the gap:

  1. Subtract a baseline. g^=E[logπ(Rb(s))]\hat{g} = \mathbb{E}[\nabla \log \pi \cdot (R - b(s))]. The estimator stays unbiased if bb doesn’t depend on the action. Best choice: b(s)=Vπ(s)b(s) = V^\pi(s), the value function. Then Rb=R - b = the advantage (next lesson).
  2. Use return-to-go, not full return. Actions can’t influence past rewards, so the past doesn’t belong in the gradient signal. g^tlogπGt\hat{g}_t \propto \nabla \log \pi \cdot G_t where Gt=k=tTrkG_t = \sum_{k=t}^T r_k only.

With both improvements, REINFORCE becomes REINFORCE-with-baseline, which is structurally identical to actor-critic and to the policy-gradient inner loop of PPO.

Mental model

Sample → score → backprop through the log-prob → repeat. That’s the whole loop.

Key takeaways

  1. The log-derivative trick turns an intractable gradient through a sampler into a Monte Carlo expectation. This is the enabling move in RL.
  2. REINFORCE is the parent algorithm for PPO, GRPO, A2C, and every LLM RL method. The improvements are about variance and stability, not the basic structure.
  3. Variance is the practical bottleneck. Baselines + return-to-go + clipping + KL constraints are all variance/stability tools.
  4. It works for any sampler. You don’t need a differentiable environment — the environment can be a black box. This is why RL is the natural fit for LLMs with verifiers.
  5. One line of code, one big idea. loss = -(logprob * advantage).mean(). Everything downstream is a refinement of this.

Go deeper

TL;DR

  • Goal: maxθJ(θ)=Eτπθ[R(τ)]\max_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)].
  • Log-derivative trick: J=E[logpθ(τ)R]\nabla J = \mathbb{E}[\nabla \log p_\theta(\tau) \cdot R].
  • REINFORCE estimator: g^=tlogπθ(atst)Gt\hat{g} = \sum_t \nabla \log \pi_\theta(a_t|s_t) \cdot G_t.
  • Variance reduction: subtract baseline b(s)Vπ(s)b(s) \approx V^\pi(s); use return-to-go.

Why this matters

REINFORCE is the parent of every LLM RL algorithm. PPO, GRPO, RLOO — all are baselined + clipped + KL-constrained REINFORCE.

Concrete walkthrough

Derivation (memorize this):

θEτ[R]=Eτ[θlogpθ(τ)R]=Eτ[tθlogπθ(atst)R]\nabla_\theta \mathbb{E}_\tau[R] = \mathbb{E}_\tau\big[\nabla_\theta \log p_\theta(\tau) \cdot R\big] = \mathbb{E}_\tau\Big[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R\Big]

With baseline + return-to-go:

g^=tθlogπθ(atst)(Gtb(st))\hat{g} = \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))

Python (the whole algorithm):

# trajectory: list of (s, a, r); policy: pi_theta(a|s) logprobs = [pi.log_prob(a, s) for s, a, _ in trajectory] returns = [sum(r for _, _, r in trajectory[t:]) for t in range(len(trajectory))] loss = -(torch.stack(logprobs) * torch.tensor(returns)).mean() loss.backward(); opt.step()

That’s the entire algorithm. Every improvement (baseline, GAE, PPO clip, GRPO advantage) is a substitution into returns.

Key takeaways

  1. Log-derivative trick is the enabling move.
  2. loss = -(logprob * advantage).mean() is the whole shape.
  3. Variance reduction matters more than the algorithm name.
  4. Trajectories are samples — RL doesn’t need a differentiable environment.

Go deeper