Self-Play, Curriculum & RL for Code
Three threads that thread through modern RL: self-play (the model improves by playing against itself or earlier versions); curriculum learning (start easy, get harder, schedule the difficulty); RL for code (the cleanest applied success story — verifiable rewards via unit tests, used by Devin, Claude Code training, and every coding agent in 2026). This lesson is the three patterns and how they intersect.
TL;DR
- Self-play: the policy generates rollouts; the better rollouts become training data for the next iteration. Iterative self-improvement. R1-Zero is essentially self-play with a verifier; AlphaZero-style games are the historical reference.
- Curriculum: schedule the difficulty of training prompts. Start with easy problems where signal is dense; ramp to hard ones where the policy is challenged. Empirically valuable; theoretically poorly understood.
- RL for code: the canonical RLVR application. Code is verifiable (unit tests), debuggable (interpreters give signal), and high-value (paying customer feedback loop). Devin, Claude Code, GitHub Copilot Workspace all train on this.
- Common pattern: cold-start SFT → curriculum-scheduled GRPO+RLVR with self-play sample generation. Three threads converging into one recipe.
Why this matters
These three patterns recur in every active 2026 RL recipe. Self-play is how DeepSeek-R1 generated reasoning data without human writers. Curriculum is how compute-efficient training works in practice. RL for code is the most-shipped applied product of the entire RL-for-LLMs effort and is the simplest case study to learn from.
Self-play
Definition: the policy generates training data; better-quality outputs become next iteration’s targets. The model trains against (or from) its own outputs.
Classical self-play (AlphaZero, AlphaGo): two policies play against each other; the winner’s moves train both. Capability climbs via internal competition.
LLM self-play variants:
- Rejection-sampling SFT: for each prompt, sample G rollouts; keep the best (highest reward); SFT on those. Repeat. This is R1’s Phase 3. Each iteration, the policy generates higher-quality training data because it’s getting better.
- DPO self-play (SPIN): at each step, current policy generates rollouts; pairs with previous-checkpoint rollouts; DPO with current as “chosen”, previous as “rejected”. Drives capability upward by always trying to outperform yesterday.
- Multi-agent self-play: two LLMs debate or play games; outcomes train both. Less common in production; active research.
The empirical observation: self-play with a verifier (RLVR-style) doesn’t saturate the way classical RL does. R1-Zero’s accuracy curves keep climbing for tens of thousands of RL steps. This is unusual — most RL plateaus.
Curriculum learning
Definition: order training prompts from easy to hard so the policy gets dense signal early and challenging signal later.
Why it matters in RL: if every prompt is too hard (reward = 0 always) or too easy (reward = 1 always), the advantage signal is dead. Group-relative methods (GRPO) particularly need a mix — within a group, you want some successes and some failures.
Common curriculum recipes:
- Difficulty bucketing: pre-rate problems by difficulty (e.g., AMC math levels, codeforces ratings). Train on easiest 30% for the first N steps, easiest 60% for the next N, etc.
- Adaptive curriculum: track per-problem success rate during training. Up-weight problems where the policy is on the boundary (success rate near 0.5).
- Self-paced: the model writes a difficulty rating for each problem; problems near the policy’s frontier get oversampled.
Tülu 3 documents a curriculum recipe. OpenMathInstruct-2 uses difficulty bucketing. Most strong open recipes do some curriculum, often informally.
RL for code (the cleanest case study)
Code is the perfect domain for RL on LLMs:
- Verifiable: unit tests give clean 0/1 reward.
- Composable: tasks decompose into subproblems naturally.
- High commercial value: paying users provide a feedback loop.
- Multi-step natural: agentic with debugger, REPL, file system.
The 2024-2026 wave of coding agents (Devin, Claude Code, Cursor agent, Codex, Aider) are all trained substantially with RL. The recipes are not fully public but share structure:
- Cold-start SFT on human-written code (often from GitHub).
- RLVR with unit-test rewards on competitive programming, LeetCode, real-world bugs.
- Multi-turn agentic RL: model runs code, sees output, debugs, repeats.
- Self-play on coding tasks: model generates problems, solves them, weaker solutions become training rejected-examples.
Public open recipes worth reading:
- OpenCodeReasoning (2025): full open recipe for code-RL.
- DeepSeek-Coder training paper.
- AceCoder / Code-R1 papers.
Engineering specifics:
- Code sandbox is the binding throughput constraint. Modal, E2B, Firejail are the common production choices.
- Test isolation matters (your model can’t reach the internet during evaluation).
- Time budgets per test are critical (timeouts dominate failure modes).
- Multi-turn context grows fast (terminal output, file contents) — RadixAttention helps.
How the three patterns intersect
A 2026 production reasoning-and-code recipe (composite, simplified):
Step 1: Cold-start SFT on curated long-CoT code reasoning
Curriculum: easy problems first, hard later
Step 2: GRPO + RLVR on math + code
Curriculum: difficulty-bucketed
Self-play: sample G rollouts per prompt, group-normalize
Verifier: math parser + code sandbox
Step 3: Rejection-sampling SFT (self-play)
Take best rollouts from Step 2 policy
Fine-tune on these
Mixed with safety + general data
Step 4: Final GRPO with blended rewards
Same curriculum
Self-play maintained via group rolloutsSelf-play in rejection-sampling SFT, curriculum in problem ordering, RLVR via the verifier. Three patterns one recipe.
Mental model
The three loops: curriculum supplies prompts; self-play generates rollouts; RLVR scores them; gradient improves the policy; better policy generates better rollouts.
Key takeaways
- Self-play with a verifier doesn’t saturate — accuracy keeps climbing. The R1-Zero result.
- Curriculum keeps GRPO signal alive — too-hard prompts give 0 advantage everywhere.
- RL for code is the cleanest applied success. Verifiable, composable, commercially valuable.
- Modern recipes intersect all three: curriculum-scheduled, self-play-generating, RLVR-trained.
- The 2026 coding-agent stack (Devin, Claude Code, Cursor) is the canonical case study. Read OpenCodeReasoning for an open replication.
Go deeper
- PaperSilver et al. — AlphaZeroClassical self-play. Foundation paper for the concept.
- PaperChen et al. — SPIN: Self-Play Fine-TuningDPO-flavored self-play for LLMs. Clean recipe.
- PaperDeepSeek-R1 (rejection-sampling SFT)Phase 3 of R1 is self-play via best-rollout SFT. Required.
- PaperOpenCodeReasoningOpen reasoning-code RL recipe with full data + hyperparameters. 2025 reference.
- PaperDeepSeek-CoderDeepSeek's code-training recipe. Strong open reference.
- PaperSoviany et al. — Curriculum Learning: A SurveyComprehensive curriculum survey. Pre-LLM but lessons apply.
- PaperTülu 3 — curriculum + RLVROpen recipe with curriculum scheduling.
- PaperAceCoderCode-RL with execution feedback.
- PaperCode-R1GRPO on code with verifier; one of the strongest open code-RL recipes in 2025.
- BlogCognition — Introducing DevinThe Devin announcement; less technical detail but architecture-suggestive.
TL;DR
- Self-play with verifier doesn’t saturate.
- Curriculum keeps GRPO advantage signal alive.
- Code is the cleanest RL-LLM domain (verifier, composability, commercial value).
- Modern recipes intersect all three.
Why this matters
The dominant 2026 recipe structure. Code-agent training the named applied product.
Concrete walkthrough
Self-play loop (rejection-sampling SFT style):
policy = sft_base
for iteration in range(N_ITER):
rollouts = []
for prompt in dataset:
G_completions = policy.sample(prompt, n=G)
scores = [verifier(c) for c in G_completions]
best = G_completions[argmax(scores)]
rollouts.append((prompt, best))
policy = sft(policy, rollouts) # fine-tune on best rolloutsCurriculum schedule (difficulty bucketing):
buckets = sort_by_difficulty(prompts) # easiest first
def get_batch(step):
bucket_cutoff = min(0.3 + step / 10000, 1.0)
available = buckets[:int(len(buckets) * bucket_cutoff)]
return sample(available, batch_size)Adaptive curriculum:
success_rate = {prompt_id: ema_succeses[prompt_id] for prompt_id in dataset}
boundary = {p: 1 - abs(0.5 - success_rate[p]) for p in dataset} # peak at 0.5
batch = sample_weighted(dataset, weights=boundary)Code-RL specifics
| Concern | Solution |
|---|---|
| Sandbox throughput | Modal / E2B / Firejail pool |
| Test isolation | No network, no FS write outside /tmp |
| Time budget | 10-30s per test |
| Multi-turn context | RadixAttention prefix cache |
| Reward shaping | Test-pass count, partial credit possible |
Key takeaways
- Self-play + verifier = no saturation (R1-Zero).
- Curriculum keeps signal alive in GRPO.
- Code is the canonical RLVR domain.
- Modern recipes combine all three.
Go deeper
- PaperDeepSeek-R1
- PaperOpenCodeReasoning
- PaperSPIN