GRPO & RL Reasoning
DPO killed RLHF for the easy case. For the hard case — reasoning — RL came back, and it came back simpler.
The classical RL recipe () needed a learned : a second neural network the same shape as the policy, trained alongside it, that predicts “expected future reward from this state.” The value function gave you a baseline to subtract from raw rewards (variance reduction), but it doubled GPU memory, doubled forward passes, and was a famous source of training instability — the value loss could diverge from the policy and quietly poison the gradient.
(DeepSeek, 2024) threw it out. Instead of learning a value baseline, it samples G rollouts for each prompt, scores them with a checker, and normalizes the rewards within the group: (r_i − μ) / σ. The mean of the group is your baseline. The std is your scale. No second network.
Pair that with — math answers checked by a parser, code checked by unit tests, format checked by regex — and you don’t need a learned reward model either. The training signal is a bash exit code.
The result, in January 2025, was DeepSeek-R1. Open weights. Public recipe. Reasoning capability that matched OpenAI’s then-proprietary o1. The whole field shifted in three months. This lesson is the algorithm, the recipe, and the “aha moment” that made everyone pay attention.
TL;DR
- GRPO (Group Relative Policy Optimization) is the RL algorithm DeepSeek used for R1. It removes PPO’s value function — uses group-relative advantages instead.
- The training signal is verifiable: math problems checked by a parser, code checked by unit tests. No reward model, no human labels.
- The model learns to emit long chain-of-thought by RL on this signal alone — DeepSeek-R1-Zero went from 15% → 71% on AIME with no SFT, just GRPO on math.
- R1 final uses cold-start SFT → GRPO → SFT-on-best-rollouts → another GRPO pass. The “aha moment” — the model spontaneously starts saying “Wait, let me reconsider” — emerged purely from the RL signal.
- OpenAI’s o1/o3 family is widely believed to use a similar (proprietary) RL-on-reasoning recipe. As of April 2026, GRPO and its variants are the dominant post-training paradigm for reasoning models.
Mental model
The algorithm: for each prompt, sample rollouts, score them with a verifier, normalize the rewards within the group, take a gradient step.
The trick is that the group-relative normalization removes the need for a value function: variance reduction comes from the group, not from a learned baseline.
GRPO step by step
For a prompt , sample completions from the current policy and score each with a verifier :
Group-relative advantage:
where and are the mean and std of .
Loss (per token in completion ):
where is the PPO-style importance ratio (current policy probability / old policy probability of the same token), and clip is identical to PPO’s clip term.
What’s missing vs PPO:
- No value function . The variance reduction comes from group-relative advantage. This saves you from training a separate critic — which in PPO doubles forward passes and adds instability.
- No GAE. Just instantaneous group-normalized reward.
What’s still there:
- KL penalty against a — keeps the policy from collapsing.
- Importance ratio + clip — prevents one big update from breaking things.
The R1 recipe (the full pipeline)
DeepSeek-R1 was trained in four phases:
Phase 1: Cold-start SFT
- ~100K human-written long CoT examples on math/code
- Output: a base capable of long-form reasoning chains
Phase 2: RL with GRPO (the heavy lifting)
- Math problems verified by exact-match
- Code problems verified by unit tests
- Reward = 1 (correct) or 0 (wrong) + format bonus for using <think>...</think>
- ~10K-30K RL steps over millions of rollouts
- Result: model emits 1000-10000 token reasoning chains, accuracy soars
Phase 3: Rejection-sampling SFT
- For each problem, take the BEST rollouts from phase-2 policy
- SFT the base model on these high-quality reasoning traces
- Mixes in non-reasoning data (chat, safety) to keep general capability
Phase 4: Final RL pass
- Another GRPO round on the SFT'd model from phase 3
- Adds a smaller reward signal for being "helpful and harmless"The result: R1 matches o1 on AIME, MATH, GPQA, and approaches it on coding. Open weights. Reproducible.
The “aha moment”
The most-cited result from the R1 paper: during phase 2 RL, the model spontaneously develops self-reflection patterns — phrases like “Wait, let me reconsider this”, “Actually, I think I made an error here”. These weren’t in any training data. They emerged because they led to higher rewards.
This is the closest the field has come to demonstrating learned reasoning behavior, as opposed to imitated reasoning behavior. It’s a real qualitative phenomenon and it’s the reason every lab pivoted in 2025.
Real-world adoption (April 2026)
| Model | Confirmed / inferred recipe |
|---|---|
| OpenAI o1 (Sep 2024) | RL on long CoT, details proprietary; widely believed PPO or PPO-variant + verifiable rewards |
| DeepSeek-R1 (Jan 2025) | GRPO + cold-start SFT, public recipe |
| Anthropic Claude 3.7 (early 2025) | Reasoning mode, recipe undisclosed |
| Gemini 2.5 reasoning (mid 2025) | Reasoning mode, recipe undisclosed |
| Qwen3-Thinking (2025) | Open source GRPO-derivative, very strong on reasoning |
| Llama-4 reasoning variant | GRPO-flavored RL with verifiable rewards |
GRPO and its variants (Dr. GRPO with stronger advantage estimators, Reinforce++ with simpler updates, etc.) are now the open default for reasoning post-training.
What the systems engineer should know
The math is on a postcard. The systems are not. A GRPO step at scale touches:
- Inference cluster — sample G rollouts per prompt, often 4096 tokens each, possibly with a different model checkpoint than the training cluster. vLLM or SGLang on a separate pool.
- Verifier service — math parsers, code sandboxes (often firejail or Bubblewrap), format regexes. Latency is the bottleneck once the rollout cluster is sized right.
- Trainer — the GRPO loss itself, the KL term against a frozen ref, the importance ratio. Standard PyTorch DDP/FSDP with one extra forward pass for the ref model (often offloaded or quantized to keep memory in check).
- Replay buffer / scheduler — the off-policy gap between rollout-time and gradient-time matters for stability.
In open implementations like verl (volcengine) and open-r1 (Hugging Face), each of those is a separate process. The Python the user writes is trainer = GRPOTrainer(...); trainer.train(). What it orchestrates is a distributed system.
Run it in your browser — toy GRPO update
The whole thing fits on a postcard. The engineering of running it at scale (efficient rollout, prefix caching, scheduling) is harder than the math.
Quick check
Key takeaways
- GRPO removes the value function from PPO — group-relative advantage replaces it. Massive engineering simplification.
- Verifiable rewards are the unlock — math, code, and constrained tasks where correctness is checkable. RL on these works without a learned reward model.
- R1’s recipe is public and reproducible. Cold-start SFT → GRPO → rejection-sampling SFT → final GRPO. Several open replications now exist (DeepSeek-R1-Distill, Light-R1, OpenR1).
- The “aha moment” is real. Self-correction patterns emerge from the RL signal alone. This is qualitatively new behavior in 2025.
- GRPO is one of several variants — Dr. GRPO, Reinforce++, GRPO with token-level advantage. Watch the literature; the algorithm is still evolving.
Go deeper
- PaperDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningThe paper. Section 2.2 has the full GRPO objective. Required reading.
- PaperDeepSeekMath: Pushing the Limits of Mathematical Reasoning (introduces GRPO)Where GRPO was first introduced — pre-R1. Foundation paper.
- PaperDr. GRPO: A Better Advantage Estimator for GRPOA variance-reduction improvement on the basic GRPO. State of the art for reasoning RL through 2025.
- BlogHugging Face — Open R1 ProjectA community-led full open replication of R1. Code, recipes, ablations.
- Repohuggingface/open-r1Reference open implementation. Read `src/open_r1/grpo.py`.
- VideoAndrej Karpathy — Deep Dive on RL TrainingA 90-minute walkthrough of RL post-training including GRPO. Best non-paper explainer.
- PaperProximal Policy Optimization (PPO)The predecessor. Read once to understand what GRPO simplifies away.
- Repovolcengine/verlA production-grade RL training framework that supports GRPO at scale.
- BlogNathan Lambert — The Rise of Reasoning MachinesExcellent context on why GRPO mattered, post-R1.
TL;DR
- GRPO (Group Relative Policy Optimization) is the RL algorithm DeepSeek used for R1. It removes PPO’s value function — uses group-relative advantages instead.
- The training signal is verifiable: math problems checked by a parser, code checked by unit tests. No reward model, no human labels.
- The model learns to emit long chain-of-thought by RL on this signal alone — DeepSeek-R1-Zero went from 15% → 71% on AIME with no SFT, just GRPO on math.
- R1 final uses cold-start SFT → GRPO → SFT-on-best-rollouts → another GRPO pass. The “aha moment” — the model spontaneously starts saying “Wait, let me reconsider” — emerged purely from the RL signal.
- OpenAI’s o1/o3 family is widely believed to use a similar (proprietary) RL-on-reasoning recipe. As of April 2026, GRPO and its variants are the dominant post-training paradigm for reasoning models.
Why this matters
Pre-2024 LLMs were trained to imitate human-written text. Reasoning came out OK but capped at “what humans usually write down”. RL on verifiable rewards lets the model discover reasoning strategies that humans wouldn’t have written explicitly — because they only had to succeed, not look natural.
The DeepSeek-R1 paper (January 2025) was the first public demonstration that this works at scale. Open weights. Reproducible. The whole field shifted within months. Every frontier lab now has an o1/R1-style recipe.
If you’re in ML systems and you’re not familiar with GRPO at the equation level, you’re 18 months behind in 2026.
Mental model
The algorithm: for each prompt, sample rollouts, score them with a verifier, normalize the rewards within the group, take a gradient step.
The trick is that the group-relative normalization removes the need for a value function: variance reduction comes from the group, not from a learned baseline.
Concrete walkthrough — GRPO step by step
For a prompt , sample completions from the current policy and score each with a verifier :
Group-relative advantage:
where and are the mean and std of .
Loss (per token in completion ):
where is the PPO-style importance ratio (current policy probability / old policy probability of the same token), and clip is identical to PPO’s clip term.
What’s missing vs PPO:
- No value function . The variance reduction comes from group-relative advantage. This saves you from training a separate critic — which in PPO doubles forward passes and adds instability.
- No GAE. Just instantaneous group-normalized reward.
What’s still there:
- KL penalty against a reference (frozen base) policy — keeps the policy from collapsing.
- Importance ratio + clip — prevents one big update from breaking things.
The R1 recipe (the full pipeline)
DeepSeek-R1 was trained in four phases:
Phase 1: Cold-start SFT
- ~100K human-written long CoT examples on math/code
- Output: a base capable of long-form reasoning chains
Phase 2: RL with GRPO (the heavy lifting)
- Math problems verified by exact-match
- Code problems verified by unit tests
- Reward = 1 (correct) or 0 (wrong) + format bonus for using <think>...</think>
- ~10K-30K RL steps over millions of rollouts
- Result: model emits 1000-10000 token reasoning chains, accuracy soars
Phase 3: Rejection-sampling SFT
- For each problem, take the BEST rollouts from phase-2 policy
- SFT the base model on these high-quality reasoning traces
- Mixes in non-reasoning data (chat, safety) to keep general capability
Phase 4: Final RL pass
- Another GRPO round on the SFT'd model from phase 3
- Adds a smaller reward signal for being "helpful and harmless"The result: R1 matches o1 on AIME, MATH, GPQA, and approaches it on coding. Open weights. Reproducible.
The “aha moment”
The most-cited result from the R1 paper: during phase 2 RL, the model spontaneously develops self-reflection patterns — phrases like “Wait, let me reconsider this”, “Actually, I think I made an error here”. These weren’t in any training data. They emerged because they led to higher rewards.
This is the closest the field has come to demonstrating learned reasoning behavior, as opposed to imitated reasoning behavior. It’s a real qualitative phenomenon and it’s the reason every lab pivoted in 2025.
Real-world adoption (April 2026)
| Model | Confirmed / inferred recipe |
|---|---|
| OpenAI o1 (Sep 2024) | RL on long CoT, details proprietary; widely believed PPO or PPO-variant + verifiable rewards |
| DeepSeek-R1 (Jan 2025) | GRPO + cold-start SFT, public recipe |
| Anthropic Claude 3.7 (early 2025) | Reasoning mode, recipe undisclosed |
| Gemini 2.5 reasoning (mid 2025) | Reasoning mode, recipe undisclosed |
| Qwen3-Thinking (2025) | Open source GRPO-derivative, very strong on reasoning |
| Llama-4 reasoning variant | GRPO-flavored RL with verifiable rewards |
GRPO and its variants (Dr. GRPO with stronger advantage estimators, Reinforce++ with simpler updates, etc.) are now the open default for reasoning post-training.
Run it in your browser — toy GRPO update
The whole thing fits on a postcard. The engineering of running it at scale (efficient rollout, prefix caching, scheduling) is harder than the math.
Quick check
Key takeaways
- GRPO removes the value function from PPO — group-relative advantage replaces it. Massive engineering simplification.
- Verifiable rewards are the unlock — math, code, and constrained tasks where correctness is checkable. RL on these works without a learned reward model.
- R1’s recipe is public and reproducible. Cold-start SFT → GRPO → rejection-sampling SFT → final GRPO. Several open replications now exist (DeepSeek-R1-Distill, Light-R1, OpenR1).
- The “aha moment” is real. Self-correction patterns emerge from the RL signal alone. This is qualitatively new behavior in 2025.
- GRPO is one of several variants — Dr. GRPO, Reinforce++, GRPO with token-level advantage. Watch the literature; the algorithm is still evolving.
Go deeper
- PaperDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningThe paper. Section 2.2 has the full GRPO objective. Required reading.
- PaperDeepSeekMath: Pushing the Limits of Mathematical Reasoning (introduces GRPO)Where GRPO was first introduced — pre-R1. Foundation paper.
- PaperDr. GRPO: A Better Advantage Estimator for GRPOA variance-reduction improvement on the basic GRPO. State of the art for reasoning RL through 2025.
- BlogHugging Face — Open R1 ProjectA community-led full open replication of R1. Code, recipes, ablations.
- Repohuggingface/open-r1Reference open implementation. Read `src/open_r1/grpo.py`.
- VideoAndrej Karpathy — Deep Dive on RL TrainingA 90-minute walkthrough of RL post-training including GRPO. Best non-paper explainer.
- PaperProximal Policy Optimization (PPO)The predecessor. Read once to understand what GRPO simplifies away.
- Repovolcengine/verlA production-grade RL training framework that supports GRPO at scale.
- BlogNathan Lambert — The Rise of Reasoning MachinesExcellent context on why GRPO mattered, post-R1.