MDPs & Bellman Equations
Before you can train a model with RL, you have to agree on what “the model” and “the environment” even mean. The answer is a Markov Decision Process — five symbols on a postcard that describe every RL setting from CartPole to GPT-4 post-training. Almost every “advanced” RL idea (advantage, GAE, TD-learning, PPO’s clip) is just a clever way to estimate or stabilize a Bellman equation. Get this lesson cold and the rest of the track stops being magic.
TL;DR
- A Markov Decision Process is the 5-tuple — states, actions, transition probabilities, reward function, discount factor.
- The Markov property: the future depends only on the current state. In LLM RL, “state” = the full token history so far; this is technically Markov because nothing else exists.
- The return is what we want to maximize. keeps it finite.
- Value functions: = expected return from following ; = same but you take action first.
- Bellman equation: — value of a state is its immediate reward plus the discounted value of the next state. Every RL algorithm bootstraps off this.
Why this matters
LLM post-training is a degenerate MDP: a single step per episode (in the simplest framing — the prompt is , the completion is the action, the verifier emits the reward, episode ends). But the formalism still gives you the vocabulary — advantage, value baseline, return — that PPO/GRPO/DPO papers assume you already know. Multi-turn agentic RL (the 2026 frontier) drops the degenerate framing entirely and becomes a real sequential MDP. If you don’t have MDPs at your fingertips, the agentic RL papers read like noise.
The concept
An MDP is the mathematical model of an agent acting in a world:
- At each timestep the agent is in state .
- It picks an action from its policy .
- The environment moves it to state and gives it reward .
- The agent’s job is to find that maximizes expected discounted return .
The Bellman expectation equation is the most important equation in RL:
It says: the value of a state under policy equals the average over all actions might take, of the immediate reward plus the discounted value of where you land. Every value-based method (Q-learning, DQN, SARSA, TD()) is some way of estimating or from samples. Every policy-gradient method uses a value estimate as a baseline (next lesson). And the KL-constrained methods (PPO, TRPO) constrain changes to in a way that preserves the Bellman structure.
Mental model
Translating to LLM RL: is the prompt, is the next token, is deterministic concatenation, is 0 for every token except the last (where the verifier or reward model emits a score). The episode is a sequence of tokens; the return is the terminal reward.
Key takeaways
- Memorize the 5-tuple. . You’ll see it in every paper.
- State = full history in LLM RL. The Markov property is satisfied trivially.
- The Bellman equation is the foundation. Value functions, advantage, TD-learning, GAE — all are bootstraps off it.
- Discount factor isn’t a hyperparameter, it’s a modeling choice. In LLM RL with a terminal reward, is the natural setting; in any infinite-horizon problem, is the only way returns stay finite.
- MDPs assume stationarity. In RLHF with a moving policy and a frozen reward model, this is almost but not quite true; importance sampling fixes it.
Go deeper
- BookSutton & Barto — Reinforcement Learning: An Introduction (2nd ed)Chapters 3 and 4 are the canonical reference. Free PDF. If you read only one book on RL, this is it.
- VideoDavid Silver — UCL RL CourseLectures 2 and 3 are the cleanest video explanation of MDPs and the Bellman equation. Watch on 1.25×.
- DocsOpenAI Spinning Up — Intro to RLThe terse, engineer-friendly version of Sutton & Barto Ch 3. Best 90-minute orientation.
- BlogLilian Weng — Policy Gradient AlgorithmsThe MDP setup at the top of this post is the cleanest one-page summary anywhere online.
- VideoBerkeley CS285 — Sergey LevineLecture 2 covers MDPs at the rigor level you actually need for research. Heavier than Silver.
TL;DR
- MDP = .
- Bellman: and .
- LLM RL is a (usually) terminal-reward MDP — prompt is , tokens are actions.
- Every value-based method estimates or ; every policy-gradient method uses one as a baseline.
Why this matters
Vocabulary load-bearing for the rest of the track. Papers will not re-derive this.
Concrete walkthrough
| Symbol | LLM RL meaning |
|---|---|
| All possible token-history prefixes | |
| Vocabulary (~128K tokens for modern models) | |
| $P(s’ | s,a)$ |
| Usually 0 until terminal; verifier/RM emits final score | |
| Typically 1.0 (terminal reward) or 0.99 (per-token shaping) | |
| Expected final reward from token-prefix under policy | |
| Expected final reward if next token is | |
| Advantage | — how much better is than average |
Bellman expectation:
Bellman optimality:
The optimal policy: .
Key takeaways
- MDP 5-tuple is non-negotiable vocabulary.
- Bellman expectation is the recursion behind every value method.
- In LLM RL, and terminal reward are the standard simplification.
- State = full token history (trivially Markov).
- Advantage is the causal quantity policy gradient should care about.
Go deeper
- BookSutton & Barto — RL: An Introduction (Ch 3-4)The reference.
- DocsOpenAI Spinning Up — Intro to RLBest terse engineer-friendly summary.
- VideoDavid Silver — UCL RL Course (Lec 2-3)Cleanest video explanation.