Skip to content

MDPs & Bellman Equations

Before you can train a model with RL, you have to agree on what “the model” and “the environment” even mean. The answer is a Markov Decision Process — five symbols on a postcard that describe every RL setting from CartPole to GPT-4 post-training. Almost every “advanced” RL idea (advantage, GAE, TD-learning, PPO’s clip) is just a clever way to estimate or stabilize a Bellman equation. Get this lesson cold and the rest of the track stops being magic.

TL;DR

  • A Markov Decision Process is the 5-tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) — states, actions, transition probabilities, reward function, discount factor.
  • The Markov property: the future depends only on the current state. In LLM RL, “state” = the full token history so far; this is technically Markov because nothing else exists.
  • The return Gt=k=0γkrt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} is what we want to maximize. γ[0,1)\gamma \in [0,1) keeps it finite.
  • Value functions: Vπ(s)V^\pi(s) = expected return from ss following π\pi; Qπ(s,a)Q^\pi(s,a) = same but you take action aa first.
  • Bellman equation: Vπ(s)=Eπ[r+γVπ(s)]V^\pi(s) = \mathbb{E}_\pi[r + \gamma V^\pi(s')] — value of a state is its immediate reward plus the discounted value of the next state. Every RL algorithm bootstraps off this.

Why this matters

LLM post-training is a degenerate MDP: a single step per episode (in the simplest framing — the prompt is s0s_0, the completion is the action, the verifier emits the reward, episode ends). But the formalism still gives you the vocabulary — advantage, value baseline, return — that PPO/GRPO/DPO papers assume you already know. Multi-turn agentic RL (the 2026 frontier) drops the degenerate framing entirely and becomes a real sequential MDP. If you don’t have MDPs at your fingertips, the agentic RL papers read like noise.

The concept

An MDP is the mathematical model of an agent acting in a world:

  • At each timestep tt the agent is in state stSs_t \in \mathcal{S}.
  • It picks an action atAa_t \in \mathcal{A} from its policy π(as)\pi(a|s).
  • The environment moves it to state st+1P(st,at)s_{t+1} \sim P(\cdot | s_t, a_t) and gives it reward rt+1=R(st,at,st+1)r_{t+1} = R(s_t, a_t, s_{t+1}).
  • The agent’s job is to find π\pi that maximizes expected discounted return E[G0]\mathbb{E}[G_0].

The Bellman expectation equation is the most important equation in RL:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \big[ R(s,a,s') + \gamma V^\pi(s') \big]

It says: the value of a state under policy π\pi equals the average over all actions π\pi might take, of the immediate reward plus the discounted value of where you land. Every value-based method (Q-learning, DQN, SARSA, TD(λ\lambda)) is some way of estimating VV or QQ from samples. Every policy-gradient method uses a value estimate as a baseline (next lesson). And the KL-constrained methods (PPO, TRPO) constrain changes to π\pi in a way that preserves the Bellman structure.

Mental model

Translating to LLM RL: s0s_0 is the prompt, ata_t is the next token, PP is deterministic concatenation, RR is 0 for every token except the last (where the verifier or reward model emits a score). The episode is a sequence of tokens; the return is the terminal reward.

Key takeaways

  1. Memorize the 5-tuple. (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma). You’ll see it in every paper.
  2. State = full history in LLM RL. The Markov property is satisfied trivially.
  3. The Bellman equation is the foundation. Value functions, advantage, TD-learning, GAE — all are bootstraps off it.
  4. Discount factor isn’t a hyperparameter, it’s a modeling choice. In LLM RL with a terminal reward, γ=1\gamma = 1 is the natural setting; in any infinite-horizon problem, γ<1\gamma < 1 is the only way returns stay finite.
  5. MDPs assume stationarity. In RLHF with a moving policy and a frozen reward model, this is almost but not quite true; importance sampling fixes it.

Go deeper

TL;DR

  • MDP = (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma).
  • Bellman: Vπ(s)=Eπ[r+γVπ(s)]V^\pi(s) = \mathbb{E}_\pi[r + \gamma V^\pi(s')] and Qπ(s,a)=E[r+γVπ(s)]Q^\pi(s,a) = \mathbb{E}[r + \gamma V^\pi(s')].
  • LLM RL is a (usually) terminal-reward MDP — prompt is s0s_0, tokens are actions.
  • Every value-based method estimates VV or QQ; every policy-gradient method uses one as a baseline.

Why this matters

Vocabulary load-bearing for the rest of the track. Papers will not re-derive this.

Concrete walkthrough

SymbolLLM RL meaning
S\mathcal{S}All possible token-history prefixes
A\mathcal{A}Vocabulary (~128K tokens for modern models)
$P(s’s,a)$
R(s,a,s)R(s,a,s')Usually 0 until terminal; verifier/RM emits final score
γ\gammaTypically 1.0 (terminal reward) or 0.99 (per-token shaping)
Vπ(s)V^\pi(s)Expected final reward from token-prefix ss under policy π\pi
Qπ(s,a)Q^\pi(s,a)Expected final reward if next token is aa
Advantage Aπ(s,a)A^\pi(s,a)Qπ(s,a)Vπ(s)Q^\pi(s,a) - V^\pi(s) — how much better aa is than average

Bellman expectation:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_a \pi(a|s) \big[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \big]

Bellman optimality:

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \big[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \big]

The optimal policy: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s,a).

Key takeaways

  1. MDP 5-tuple is non-negotiable vocabulary.
  2. Bellman expectation is the recursion behind every value method.
  3. In LLM RL, γ=1\gamma=1 and terminal reward are the standard simplification.
  4. State = full token history (trivially Markov).
  5. Advantage A=QVA = Q - V is the causal quantity policy gradient should care about.

Go deeper