Value Functions, Advantage & GAE
If REINFORCE is “scale gradient by reward”, the rest of policy-gradient RL is “scale gradient by advantage, which we estimate as cleverly as we can”. This lesson is the three quantities that show up in every paper — , , — and the GAE trick that turns “advantage” from a noun into a tunable estimator with one knob () controlling bias vs variance.
TL;DR
- Value : expected return from under policy . The baseline you should subtract.
- Q-value : expected return if you take action first, then follow .
- Advantage : how much better action is than average at . Mean-zero. This is what policy gradient should multiply by.
- TD error : a 1-step, low-variance, biased estimate of advantage.
- GAE(): . Single knob between (low variance, high bias TD) and (high variance, unbiased Monte Carlo). Default .
Why this matters
Three reasons this lesson pays for itself ten times over:
- PPO is REINFORCE with GAE. If you don’t understand GAE, the PPO paper reads like ritual.
- GRPO replaces GAE with group-relative normalization. Knowing what GAE does makes the GRPO simplification obvious.
- Reward sparsity (one reward at trajectory end) is the common case in LLM RL. Value functions are how you propagate sparse final reward back through the trajectory.
The concept
Define:
Substituting advantage into REINFORCE gives the actor-critic estimator:
The estimator is still unbiased (because — that’s the whole point of subtracting the baseline) but with much lower variance than raw return.
Now how do we estimate ? Two extremes:
- Monte Carlo: . Unbiased but high variance (you’re propagating noise from every future timestep).
- TD(0): . Low variance but biased (you’re trusting the imperfect ).
Generalized Advantage Estimation (GAE) smoothly interpolates:
At this collapses to TD(0). At it collapses to Monte Carlo. In practice is standard. Bias-variance dial with a single knob.
Mental model
Advantage = how much better than average. GAE = how to compute it cheaply from samples + a learned value function.
The critic
Where does come from? You learn it. Run a second network (or a second head on the same backbone) that takes and predicts the return; train it to regress on the actually-observed returns:
This is the actor-critic architecture: the actor is , the critic is . Standard PPO has both. GRPO famously removes the critic (the next lesson on group-relative methods).
In LLM RL, the critic is usually a value head bolted to the LM trunk — adds ~size-of-LM-head parameters and roughly doubles forward-pass FLOPs. The “doubled memory + instability” complaint about PPO is mostly about this critic.
Key takeaways
- Advantage is the right thing to multiply by — unbiased estimator, low variance.
- Three ways to estimate advantage: Monte Carlo (unbiased, noisy), TD(0) (biased, smooth), GAE (dial between them with ).
- GAE(, ) is the standard PPO config for LLM RL.
- The critic is a learned . Costs memory and adds an instability source — which is exactly what GRPO removes.
- In sparse-reward LLM settings, the critic does the work of propagating final reward to earlier tokens.
Go deeper
- PaperSchulman et al. — High-Dimensional Continuous Control with GAEThe GAE paper. Section 3 is the derivation; read it once. Same Schulman as PPO.
- PaperSutton et al. — Policy Gradient Methods for RL with Function ApproximationOriginal actor-critic. Theorem 1 is the policy gradient theorem you keep seeing cited.
- BookSutton & Barto — RL: An Introduction (Ch 6, 12, 13)Ch 6 (TD learning), 12 (eligibility traces — TD($\lambda$), which is the family GAE belongs to), 13 (policy gradient).
- BlogDaniel Seita — Notes on the GAE paperExcellent annotated walkthrough of the derivation.
- DocsOpenAI Spinning Up — Vanilla Policy GradientThe pseudo-code section shows exactly how GAE plugs into a PG loop.
TL;DR
- , , .
- Policy gradient should use advantage as the scaling factor.
- GAE(): , .
- Standard config: , .
Why this matters
GAE is the PPO default. GRPO removes it. Knowing what it does is prerequisite to knowing why each is preferred.
Concrete walkthrough
Three advantage estimators, increasing bias / decreasing variance:
| Estimator | Formula | Bias | Variance |
|---|---|---|---|
| Monte Carlo | Unbiased | High | |
| TD(0) | Biased by error | Low | |
| GAE() | Interpolates | Interpolates |
Critic loss:
Combined actor-critic loss:
with , (entropy bonus).
Key takeaways
- Advantage = Q - V. Mean-zero. Use this in PG.
- GAE() interpolates Monte Carlo and TD(0).
- is the PPO default.
- Critic cost is the reason GRPO is popular.
Go deeper
- PaperSchulman et al. — GAESection 3.
- DocsSpinning Up — VPGCode structure.