Skip to content

Value Functions, Advantage & GAE

If REINFORCE is “scale gradient by reward”, the rest of policy-gradient RL is “scale gradient by advantage, which we estimate as cleverly as we can”. This lesson is the three quantities that show up in every paper — Vπ(s)V^\pi(s), Qπ(s,a)Q^\pi(s,a), Aπ(s,a)A^\pi(s,a) — and the GAE trick that turns “advantage” from a noun into a tunable estimator with one knob (λ\lambda) controlling bias vs variance.

TL;DR

  • Value Vπ(s)V^\pi(s): expected return from ss under policy π\pi. The baseline you should subtract.
  • Q-value Qπ(s,a)Q^\pi(s,a): expected return if you take action aa first, then follow π\pi.
  • Advantage Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s): how much better action aa is than average at ss. Mean-zero. This is what policy gradient should multiply logπ\nabla \log \pi by.
  • TD error δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t): a 1-step, low-variance, biased estimate of advantage.
  • GAE(λ\lambda): AtGAE=k=0(γλ)kδt+kA^{GAE}_t = \sum_{k=0}^\infty (\gamma\lambda)^k \delta_{t+k}. Single knob between λ=0\lambda{=}0 (low variance, high bias TD) and λ=1\lambda{=}1 (high variance, unbiased Monte Carlo). Default λ0.95\lambda \approx 0.95.

Why this matters

Three reasons this lesson pays for itself ten times over:

  1. PPO is REINFORCE with GAE. If you don’t understand GAE, the PPO paper reads like ritual.
  2. GRPO replaces GAE with group-relative normalization. Knowing what GAE does makes the GRPO simplification obvious.
  3. Reward sparsity (one reward at trajectory end) is the common case in LLM RL. Value functions are how you propagate sparse final reward back through the trajectory.

The concept

Define:

  • Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s]
  • Qπ(s,a)=Eπ[Gtst=s,at=a]Q^\pi(s,a) = \mathbb{E}_\pi[G_t | s_t = s, a_t = a]
  • Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)

Substituting advantage into REINFORCE gives the actor-critic estimator:

g^=E[tθlogπθ(atst)A^t]\hat{g} = \mathbb{E}\Big[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t\Big]

The estimator is still unbiased (because Ea[A]=0\mathbb{E}_a[A] = 0 — that’s the whole point of subtracting the baseline) but with much lower variance than raw return.

Now how do we estimate AA? Two extremes:

  • Monte Carlo: A^t=GtV(st)\hat{A}_t = G_t - V(s_t). Unbiased but high variance (you’re propagating noise from every future timestep).
  • TD(0): A^t=δt=rt+γV(st+1)V(st)\hat{A}_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t). Low variance but biased (you’re trusting the imperfect VV).

Generalized Advantage Estimation (GAE) smoothly interpolates:

A^tGAE(γ,λ)=k=0(γλ)kδt+k\hat{A}^{GAE(\gamma,\lambda)}_t = \sum_{k=0}^\infty (\gamma\lambda)^k \delta_{t+k}

At λ=0\lambda = 0 this collapses to TD(0). At λ=1\lambda = 1 it collapses to Monte Carlo. In practice λ[0.9,0.97]\lambda \in [0.9, 0.97] is standard. Bias-variance dial with a single knob.

Mental model

Advantage = how much better than average. GAE = how to compute it cheaply from samples + a learned value function.

The critic

Where does VV come from? You learn it. Run a second network (or a second head on the same backbone) that takes ss and predicts the return; train it to regress on the actually-observed returns:

Lcritic(ϕ)=E[(Vϕ(st)Gt)2]\mathcal{L}_{critic}(\phi) = \mathbb{E}\Big[(V_\phi(s_t) - G_t)^2\Big]

This is the actor-critic architecture: the actor is πθ\pi_\theta, the critic is VϕV_\phi. Standard PPO has both. GRPO famously removes the critic (the next lesson on group-relative methods).

In LLM RL, the critic is usually a value head bolted to the LM trunk — adds ~size-of-LM-head parameters and roughly doubles forward-pass FLOPs. The “doubled memory + instability” complaint about PPO is mostly about this critic.

Key takeaways

  1. Advantage is the right thing to multiply logπ\nabla \log \pi by — unbiased estimator, low variance.
  2. Three ways to estimate advantage: Monte Carlo (unbiased, noisy), TD(0) (biased, smooth), GAE (dial between them with λ\lambda).
  3. GAE(λ0.95\lambda \approx 0.95, γ=1.0\gamma = 1.0) is the standard PPO config for LLM RL.
  4. The critic is a learned VV. Costs memory and adds an instability source — which is exactly what GRPO removes.
  5. In sparse-reward LLM settings, the critic does the work of propagating final reward to earlier tokens.

Go deeper

TL;DR

  • Vπ(s)V^\pi(s), Qπ(s,a)Q^\pi(s,a), Aπ(s,a)=QVA^\pi(s,a) = Q - V.
  • Policy gradient should use advantage as the scaling factor.
  • GAE(λ\lambda): A^t=k(γλ)kδt+k\hat{A}_t = \sum_k (\gamma\lambda)^k \delta_{t+k}, δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t).
  • Standard config: γ=1.0\gamma = 1.0, λ0.95\lambda \approx 0.95.

Why this matters

GAE is the PPO default. GRPO removes it. Knowing what it does is prerequisite to knowing why each is preferred.

Concrete walkthrough

Three advantage estimators, increasing bias / decreasing variance:

EstimatorFormulaBiasVariance
Monte CarloGtV(st)G_t - V(s_t)UnbiasedHigh
TD(0)rt+γV(st+1)V(st)r_t + \gamma V(s_{t+1}) - V(s_t)Biased by VV errorLow
GAE(λ\lambda)k(γλ)kδt+k\sum_k (\gamma\lambda)^k \delta_{t+k}InterpolatesInterpolates

Critic loss:

Lcritic=(Vϕ(st)Gt)2or(Vϕ(st)(Vϕ(st)old+A^t))2\mathcal{L}_{critic} = (V_\phi(s_t) - G_t)^2 \quad \text{or} \quad (V_\phi(s_t) - (V_\phi(s_t)^{old} + \hat{A}_t))^2

Combined actor-critic loss:

L=logπθ(as)A^t+cvLcriticceH(π)\mathcal{L} = -\log \pi_\theta(a|s) \cdot \hat{A}_t + c_v \mathcal{L}_{critic} - c_e \mathcal{H}(\pi)

with cv0.5c_v \approx 0.5, ce0.01c_e \approx 0.01 (entropy bonus).

Key takeaways

  1. Advantage = Q - V. Mean-zero. Use this in PG.
  2. GAE(λ\lambda) interpolates Monte Carlo and TD(0).
  3. λ0.95\lambda \approx 0.95 is the PPO default.
  4. Critic cost is the reason GRPO is popular.

Go deeper