Skip to content

LR Schedules & WSD

In a managed-runtime training framework, the learning rate is a number. You set it once, the trainer trains. PyTorch makes it richer: an LRScheduler object whose .step() you call after every optimizer step, returning a new LR for the next step. That callable is one of the cheapest, highest-leverage choices in the whole training stack — it controls a few percent of your final loss without changing anything else, and the wrong choice can quietly burn 10–30% of your compute budget.

For five years the answer was “cosine decay” — start at zero, ramp up over a short warmup, then follow a half-cosine curve down to near-zero over the planned step count. GPT-3 used it. Llama 2 used it. The problem: cosine demands a token budget up front. Want to train 20% longer because loss is still dropping? Restart the schedule or hack it; either way you’re wasting compute.

— Warmup-Stable-Decay — broke that constraint in 2024. Train at peak LR for as long as you want; trigger a short cooldown when you decide to stop. DeepSeek-V3 ran 14.3 trillion tokens at constant LR before deciding it was time to settle. MiniCPM, Qwen 2.5, and most 2025 frontier runs use the same pattern. This lesson is the three schedules you’ll actually see and how to pick.

TL;DR

  • Cosine decay was the default for GPT-3 through Llama 2: the LR follows a half-cosine from peak to ~0 over a fixed number of steps. Works well but requires knowing total training steps upfront.
  • Warmup-Stable-Decay (WSD) splits training into three phases: a short linear warmup (0.5–5% of steps), a long stable plateau at peak LR (80–90%), and a final cooldown (5–15%). MiniCPM, DeepSeek-V3, and Qwen 2.5 all use variants of WSD.
  • WSD’s killer advantage: you don’t need to decide total compute budget before training starts. You can train indefinitely at the plateau, then cool down whenever you want a strong checkpoint.
  • The cooldown phase does most of the final loss improvement — it suppresses gradient noise and lets the model settle into a sharper minimum. Skipping it costs 0.5–2% quality.

Mental model

Cosine continuously drops the LR, which means early-to-mid training is already at a reduced rate — potentially under-exploring. WSD keeps the full learning rate during the exploration phase and only cools down when you’re ready to commit.

The three schedules you’ll actually see

1. Cosine decay (GPT-3, Llama 1/2, Mistral 7B)

def cosine_lr(step, total_steps, warmup_steps, peak_lr, min_lr=0): if step < warmup_steps: return peak_lr * step / warmup_steps progress = (step - warmup_steps) / (total_steps - warmup_steps) return min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))

Pros: smooth, well-studied, one less hyperparameter (no “when to decay” decision). Cons: must know total_steps at init. Extending training means restarting. Mid-training checkpoints are suboptimal because the LR is already partially decayed.

2. WSD — Warmup-Stable-Decay (MiniCPM, DeepSeek-V3, Qwen 2.5)

def wsd_lr(step, warmup_steps, stable_steps, decay_steps, peak_lr, min_lr=0): if step < warmup_steps: return peak_lr * step / warmup_steps elif step < warmup_steps + stable_steps: return peak_lr else: decay_progress = (step - warmup_steps - stable_steps) / decay_steps decay_progress = min(decay_progress, 1.0) return min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * decay_progress))

The stable phase is the key innovation. You train at peak LR until you’re satisfied with convergence, then trigger decay for the final push. DeepSeek-V3 ran 14.8T tokens at stable LR, then decayed over the final 500B.

3. Linear decay (Llama 3, simple baseline)

Warmup → linear ramp down to min_lr. Simple. Often 90% as good as cosine.

What the numbers look like in practice

ModelScheduleWarmupStable/DecayPeak LRTokens
GPT-3 175BCosine375M tokensFull cosine6e-5300B
Llama 2 70BCosine2000 stepsFull cosine1.5e-42T
MiniCPM 2.4B (µP)WSD2000 steps90% stable / 10% decay1e-2 *1T
DeepSeek-V3 671BWSD2000 steps~97% stable / ~3% decay2.2e-414.8T
Qwen 2.5 72BWSD2000 steps~85% stable / ~15% decay1.5e-418T

Notice: DeepSeek-V3 ran at peak LR for 14.3T tokens and only decayed over the final 500B. That’s the WSD philosophy — explore fully, then settle.

* MiniCPM’s “1e-2” peak LR is a µP-parameterization artifact, not a number you’d transfer to a non-µP run. MiniCPM uses Maximal Update Parameterization (Yang et al.), which scales each layer’s effective LR by 1/width — so the raw config knob is 1e-2, but the effective LR per parameter is closer to ~1e-3 (the same order as GPT-3 / Llama). If you copy peak_lr=1e-2 into a vanilla AdamW run, you’ll diverge immediately.

Warmup: how much is enough?

Too little warmup → loss spikes, training diverges early. Too much warmup → wasted compute at low LR.

The rule of thumb from Llama 3 and DeepSeek-V3: 2000 steps is almost always sufficient, regardless of model size. For very large batch sizes, extend to ~4000 steps.

warmup_tokens ≈ 2000 × batch_size × seq_len ≈ 2000 × 4M × 4096 ≈ ~33B tokens (for a DeepSeek-V3-class run)

Run it in your browser

Python — editableCompare cosine vs WSD vs linear schedules side-by-side.
Ctrl+Enter to run

Quick check

Quick check
You're pretraining a 13B model and realize at 80% of planned compute that loss is still dropping steeply. You used cosine decay. What's the problem?

Key takeaways

  1. WSD is the new default for frontier pretraining. DeepSeek-V3, MiniCPM, and Qwen 2.5 all use it. Cosine decay is legacy for new runs.
  2. The stable phase is the key innovation. It lets you train indefinitely without committing to a step count. Decay when you decide, not when the schedule decides.
  3. Warmup = 2000 steps is almost universally sufficient. Don’t overthink it.
  4. The cooldown matters. Even a short cosine cooldown (5–15% of training) drops loss measurably. Never skip it.
  5. Linear decay is a strong baseline. If you can’t decide between cosine and WSD, linear is 90% as good with zero complexity.

Go deeper

TL;DR

  • Cosine decay was the default for GPT-3 through Llama 2: the LR follows a half-cosine from peak to ~0 over a fixed number of steps. Works well but requires knowing total training steps upfront.
  • Warmup-Stable-Decay (WSD) splits training into three phases: a short linear warmup (0.5–5% of steps), a long stable plateau at peak LR (80–90%), and a final cooldown (5–15%). MiniCPM, DeepSeek-V3, and Qwen 2.5 all use variants of WSD.
  • WSD’s killer advantage: you don’t need to decide total compute budget before training starts. You can train indefinitely at the plateau, then cool down whenever you want a strong checkpoint.
  • The cooldown phase does most of the final loss improvement — it suppresses gradient noise and lets the model settle into a sharper minimum. Skipping it costs 0.5–2% quality.

Why this matters

Cosine decay has a hidden cost: you commit to a step count before training begins. If you want to train 20% longer (because loss is still dropping), you either restart with a new schedule or hack an extension — both waste compute. MiniCPM showed in 2024 that WSD eliminates this problem entirely. DeepSeek-V3 (14.8T tokens, 2048 H800s) ran its main phase at constant LR and only decayed at the very end, after deciding exactly when to stop based on live loss curves.

For practitioners: choosing the wrong schedule — or the wrong warmup duration — can waste 10–30% of your compute budget before you even notice. This lesson teaches you how to pick.

Mental model

Cosine continuously drops the LR, which means early-to-mid training is already at a reduced rate — potentially under-exploring. WSD keeps the full learning rate during the exploration phase and only cools down when you’re ready to commit.

Concrete walkthrough

The three schedules you’ll actually see

1. Cosine decay (GPT-3, Llama 1/2, Mistral 7B)

def cosine_lr(step, total_steps, warmup_steps, peak_lr, min_lr=0): if step < warmup_steps: return peak_lr * step / warmup_steps progress = (step - warmup_steps) / (total_steps - warmup_steps) return min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))

Pros: smooth, well-studied, one less hyperparameter (no “when to decay” decision). Cons: must know total_steps at init. Extending training means restarting. Mid-training checkpoints are suboptimal because the LR is already partially decayed.

2. WSD — Warmup-Stable-Decay (MiniCPM, DeepSeek-V3, Qwen 2.5)

def wsd_lr(step, warmup_steps, stable_steps, decay_steps, peak_lr, min_lr=0): if step < warmup_steps: return peak_lr * step / warmup_steps elif step < warmup_steps + stable_steps: return peak_lr else: decay_progress = (step - warmup_steps - stable_steps) / decay_steps decay_progress = min(decay_progress, 1.0) return min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * decay_progress))

The stable phase is the key innovation. You train at peak LR until you’re satisfied with convergence, then trigger decay for the final push. DeepSeek-V3 ran 14.8T tokens at stable LR, then decayed over the final 500B.

3. Linear decay (Llama 3, simple baseline)

Warmup → linear ramp down to min_lr. Simple. Often 90% as good as cosine.

What the numbers look like in practice

ModelScheduleWarmupStable/DecayPeak LRTokens
GPT-3 175BCosine375M tokensFull cosine6e-5300B
Llama 2 70BCosine2000 stepsFull cosine1.5e-42T
MiniCPM 2.4B (µP)WSD2000 steps90% stable / 10% decay1e-2 *1T
DeepSeek-V3 671BWSD2000 steps~97% stable / ~3% decay2.2e-414.8T
Qwen 2.5 72BWSD2000 steps~85% stable / ~15% decay1.5e-418T

Notice: DeepSeek-V3 ran at peak LR for 14.3T tokens and only decayed over the final 500B. That’s the WSD philosophy — explore fully, then settle.

* MiniCPM’s “1e-2” peak LR is a µP-parameterization artifact, not a number you’d transfer to a non-µP run. MiniCPM uses Maximal Update Parameterization (Yang et al.), which scales each layer’s effective LR by 1/width — so the raw config knob is 1e-2, but the effective LR per parameter is closer to ~1e-3 (the same order as GPT-3 / Llama). If you copy peak_lr=1e-2 into a vanilla AdamW run, you’ll diverge immediately.

Warmup: how much is enough?

Too little warmup → loss spikes, training diverges early. Too much warmup → wasted compute at low LR.

The rule of thumb from Llama 3 and DeepSeek-V3: 2000 steps is almost always sufficient, regardless of model size. For very large batch sizes, extend to ~4000 steps.

warmup_tokens ≈ 2000 × batch_size × seq_len ≈ 2000 × 4M × 4096 ≈ ~33B tokens (for a DeepSeek-V3-class run)

Run it in your browser

Python — editableCompare cosine vs WSD vs linear schedules side-by-side.
Ctrl+Enter to run

Quick check

Quick check
You're pretraining a 13B model and realize at 80% of planned compute that loss is still dropping steeply. You used cosine decay. What's the problem?

Key takeaways

  1. WSD is the new default for frontier pretraining. DeepSeek-V3, MiniCPM, and Qwen 2.5 all use it. Cosine decay is legacy for new runs.
  2. The stable phase is the key innovation. It lets you train indefinitely without committing to a step count. Decay when you decide, not when the schedule decides.
  3. Warmup = 2000 steps is almost universally sufficient. Don’t overthink it.
  4. The cooldown matters. Even a short cosine cooldown (5–15% of training) drops loss measurably. Never skip it.
  5. Linear decay is a strong baseline. If you can’t decide between cosine and WSD, linear is 90% as good with zero complexity.

Go deeper