AdamW → Lion → Muon
In a managed-runtime framework, “the optimizer” is one parameter on the trainer. optimizer="adam", ship it. PyTorch hides the same way: optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) looks like a single object. What it actually is: two FP32 tensors per model parameter (m and v, the running first and second moments of the gradient), plus an FP32 master copy of the weights, plus a fused CUDA kernel that runs every training step.
For a 70B model, those m, v buffers are 560 GB if you keep them in FP32 — twice the size of the model. Optimizer state is what FSDP and ZeRO-1 shard across data-parallel ranks. It dominates the size of every intermediate checkpoint. It dictates whether you can fit your run on the GPUs you have.
Each named optimizer in this lesson — SGD-momentum, , , — is a different choice about what state to keep per parameter. The systems tradeoff is real because that state lives on every GPU and travels every checkpoint. The math tradeoff is also real, but it’s the systems consequence that pushed Lion and Muon out of research and into frontier production runs in 2024–2025.
TL;DR
- SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
- Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
- Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to 1× params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
- Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
- Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.
Mental model — what each adds over plain SGD
Each step trades cost for capability or strips away a redundant component.
SGD with momentum
m_t = β · m_{t-1} + g_t
x_t = x_{t-1} - η · m_tState per parameter: 1 buffer (). Cheap. Hyperparameters: .
AdamW (the workhorse)
m_t = β1 · m_{t-1} + (1 - β1) · g_t
v_t = β2 · v_{t-1} + (1 - β2) · g_t²
m̂_t = m_t / (1 - β1^t) # bias correction
v̂_t = v_t / (1 - β2^t)
x_t = x_{t-1} - η · ( m̂_t / (√v̂_t + ε) + λ · x_{t-1} ) # decoupled WDState per parameter: 2 buffers (, ). Standard mixed-precision Adam keeps FP32 (4 bytes) + FP32 (4 bytes) + FP32 (4 bytes) = 12 bytes / parameter of optimizer-side state, on top of the BF16 model weights themselves.
For 70B params, that’s ~840 GB of optimizer state if stay in FP32. Many production runs keep in BF16 (FP32 master weights remain), giving 4 + 2 + 2 = 8 bytes / parameter ≈ ~560 GB. ZeRO-1 / FSDP shard this across data-parallel ranks.
Why decoupled weight decay (the W in AdamW)? Vanilla Adam adds weight decay through , which gets divided by — making the effective decay strength parameter-specific. AdamW applies decay directly to , decoupled from the gradient scaling. Result: weight decay actually does what you think it does.
Lion (the surprising minimalist)
c_t = β1 · m_{t-1} + (1 - β1) · g_t # interpolated direction
x_t = x_{t-1} - η · ( sign(c_t) + λ · x_{t-1} )
m_t = β2 · m_{t-1} + (1 - β2) · g_t # update momentum (with different β2 than usage)State per parameter: 1 buffer (). Half the optimizer memory of AdamW.
The radical move is sign() — every update component is ±1, scaled by the learning rate. No second moments. No per-parameter scaling. Discovered via symbolic search by Google Brain (Chen et al., 2023). Empirically matches AdamW on many language tasks; needs a smaller learning rate (typically 1/3 to 1/10 of AdamW’s).
Muon (the 2024 frontier)
Muon orthogonalizes the momentum-buffered update via a few Newton-Schulz iterations for hidden 2D weight matrices only (embedding and output layers stay AdamW). The intuition: gradient updates for matmul weights have a meaningful “preferred direction”; orthogonalizing the update preserves expressiveness and acts as a kind of preconditioner.
def muon_step(W, m, g, lr, beta=0.95, ns_steps=5):
"""Muon: momentum buffer + Newton-Schulz orthogonalization (Keller Jordan, 2024)."""
# 1) update momentum buffer in-place with new gradient
m.mul_(beta).add_(g)
# 2) start NS from the normalized momentum-buffered gradient
X = m / (m.norm() + 1e-7)
if X.shape[0] > X.shape[1]: # work on the smaller side
X = X.T
# 3) quintic Newton-Schulz iteration. Coefficients (a, b, c) chosen so the
# fixed point is X = U Vᵀ — the orthogonal polar factor of m.
a, b, c = 3.4445, -4.7750, 2.0315
for _ in range(ns_steps):
A = X @ X.T
B = b * A + c * (A @ A)
X = a * X + B @ X
if m.shape[0] > m.shape[1]:
X = X.T
W.add_(X, alpha=-lr)
return WThe (3.4445, −4.7750, 2.0315) coefficients aren’t tunable — they’re the unique solution of a polynomial-fit problem so that 5 iterations get you to the orthogonal polar factor of m to ~5 decimal places (Bernstein–Newhouse 2024). Muon’s whole win lives in this iteration; copying the wrong coefficients silently gives you an inferior optimizer.
State: 1 buffer (), like Lion. Compute: a few extra matmuls per step (the NS iterations) — meaningful but small fraction of total.
Empirically: ~30% faster convergence per token on small/medium models, and the gap holds at scale in the public reports through 2024. Used in production by labs through 2025; see Llama-4 family release notes for hybrid AdamW-Muon configs.
Real-world picks (April 2026)
| Use case | Optimizer | Why |
|---|---|---|
| Pretraining, big LLM | AdamW (still the safest default) | Production-validated, no surprises |
| Pretraining, frontier labs experimenting | Muon for hidden, AdamW for embeddings/heads | 30% faster convergence on a curve that compounds |
| Resource-constrained pretraining | Lion | Half the optimizer memory, often equal quality |
| LoRA / fine-tuning | AdamW with paged optimizers (bitsandbytes) | Adapters are tiny anyway; reuse what works |
| RL post-training (PPO, GRPO) | AdamW | RL stability is fragile; don’t change two things at once |
Run it in your browser — optimizer memory cost
Notice AdamW’s state is ~50% larger than Lion/Muon’s — that’s the practical motivation for switching, on top of any quality story.
Quick check
Key takeaways
- AdamW remains the safe default for pretraining. Production-validated, well-tooled.
- Lion and Muon halve the optimizer state. Real money for big runs; rough quality parity (Lion) or improvement (Muon) on language modeling.
- Muon orthogonalizes hidden-layer updates. Acts as a cheap preconditioner. Frontier choice in 2024–2025.
- Decoupled weight decay (the W in AdamW) is not optional. Plain Adam’s coupled WD has parameter-specific effective strength.
- Hybrid configurations are common — Muon for big matmul weights, AdamW for embeddings and heads where the math is different.
Go deeper
- PaperDecoupled Weight Decay Regularization (AdamW)The W. Short, foundational.
- PaperSymbolic Discovery of Optimization Algorithms (Lion)The Lion paper. Symbolic search found this; humans didn't design it.
- PaperMuon: An Optimizer for Hidden Layers in Neural NetworksThe Muon paper. Newton-Schulz orthogonalization plus momentum.
- PaperSophia: A Scalable Stochastic Second-order OptimizerHessian-aware. Mostly displaced by 2025 but the framing is worth the read.
- VideoSebastian Raschka — Optimizers for Deep LearningDeep visual walkthrough of SGD → Adam → AdamW.
- BlogKeller Jordan — Muon explainedAuthor-written blog. Best non-paper Muon explainer.
- Repolucidrains/lion-pytorchDrop-in Lion implementation in PyTorch. ~80 lines.
- RepoKellerJordan/MuonReference Muon implementation.
- BlogSebastian Raschka — current optimizer landscape (2025)Most recent literature digest covering 2024-2025 results.
TL;DR
- SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
- Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
- Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to 1× params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
- Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
- Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.
Why this matters
For a 70B AdamW run, optimizer state is ~280 GB — 2× the weights. That’s why FSDP + ZeRO matter, and why Lion/Muon’s lower-state variants are economically meaningful. Saving optimizer state often dominates checkpoint size; saving optimizer state per-step dominates inter-GPU communication in Tensor Parallel.
Also: AdamW is not always the best choice. Lion at the same learning rate often equals or beats AdamW on language modeling with half the memory. The community switched from “AdamW always” to “test the new ones on your run” sometime in 2024.
Mental model — what each adds over plain SGD
Each step trades cost for capability or strips away a redundant component.
Concrete walkthrough
SGD with momentum
m_t = β · m_{t-1} + g_t
x_t = x_{t-1} - η · m_tState per parameter: 1 buffer (). Cheap. Hyperparameters: .
AdamW (the workhorse)
m_t = β1 · m_{t-1} + (1 - β1) · g_t
v_t = β2 · v_{t-1} + (1 - β2) · g_t²
m̂_t = m_t / (1 - β1^t) # bias correction
v̂_t = v_t / (1 - β2^t)
x_t = x_{t-1} - η · ( m̂_t / (√v̂_t + ε) + λ · x_{t-1} ) # decoupled WDState per parameter: 2 buffers (, ). Standard mixed-precision Adam keeps FP32 master weights (4 bytes) + FP32 (4 bytes) + FP32 (4 bytes) = 12 bytes / parameter of optimizer-side state, on top of the BF16 model weights themselves.
For 70B params, that’s ~840 GB of optimizer state if stay in FP32. Many production runs keep in BF16 (FP32 master weights remain), giving 4 + 2 + 2 = 8 bytes / parameter ≈ ~560 GB. ZeRO-1 / FSDP shard this across data-parallel ranks.
Why decoupled weight decay (the W in AdamW)? Vanilla Adam adds weight decay through , which gets divided by — making the effective decay strength parameter-specific. AdamW applies decay directly to , decoupled from the gradient scaling. Result: weight decay actually does what you think it does.
Lion (the surprising minimalist)
c_t = β1 · m_{t-1} + (1 - β1) · g_t # interpolated direction
x_t = x_{t-1} - η · ( sign(c_t) + λ · x_{t-1} )
m_t = β2 · m_{t-1} + (1 - β2) · g_t # update momentum (with different β2 than usage)State per parameter: 1 buffer (). Half the optimizer memory of AdamW.
The radical move is sign() — every update component is ±1, scaled by the learning rate. No second moments. No per-parameter scaling. Discovered via symbolic search by Google Brain (Chen et al., 2023). Empirically matches AdamW on many language tasks; needs a smaller learning rate (typically 1/3 to 1/10 of AdamW’s).
Muon (the 2024 frontier)
Muon orthogonalizes the momentum-buffered update via a few Newton-Schulz iterations for hidden 2D weight matrices only (embedding and output layers stay AdamW). The intuition: gradient updates for matmul weights have a meaningful “preferred direction”; orthogonalizing the update preserves expressiveness and acts as a kind of preconditioner.
def muon_step(W, m, g, lr, beta=0.95, ns_steps=5):
"""Muon: momentum buffer + Newton-Schulz orthogonalization (Keller Jordan, 2024)."""
# 1) update momentum buffer in-place with new gradient
m.mul_(beta).add_(g)
# 2) start NS from the normalized momentum-buffered gradient
X = m / (m.norm() + 1e-7)
if X.shape[0] > X.shape[1]: # work on the smaller side
X = X.T
# 3) quintic Newton-Schulz iteration. Coefficients (a, b, c) chosen so the
# fixed point is X = U Vᵀ — the orthogonal polar factor of m.
a, b, c = 3.4445, -4.7750, 2.0315
for _ in range(ns_steps):
A = X @ X.T
B = b * A + c * (A @ A)
X = a * X + B @ X
if m.shape[0] > m.shape[1]:
X = X.T
W.add_(X, alpha=-lr)
return WThe (3.4445, −4.7750, 2.0315) coefficients aren’t tunable — they’re the unique solution of a polynomial-fit problem so that 5 iterations get you to the orthogonal polar factor of m to ~5 decimal places (Bernstein–Newhouse 2024). Muon’s whole win lives in this iteration; copying the wrong coefficients silently gives you an inferior optimizer.
State: 1 buffer (), like Lion. Compute: a few extra matmuls per step (the NS iterations) — meaningful but small fraction of total.
Empirically: ~30% faster convergence per token on small/medium models, and the gap holds at scale in the public reports through 2024. Used in production by labs through 2025; see Llama-4 family release notes for hybrid AdamW-Muon configs.
Real-world picks (April 2026)
| Use case | Optimizer | Why |
|---|---|---|
| Pretraining, big LLM | AdamW (still the safest default) | Production-validated, no surprises |
| Pretraining, frontier labs experimenting | Muon for hidden, AdamW for embeddings/heads | 30% faster convergence on a curve that compounds |
| Resource-constrained pretraining | Lion | Half the optimizer memory, often equal quality |
| LoRA / fine-tuning | AdamW with paged optimizers (bitsandbytes) | Adapters are tiny anyway; reuse what works |
| RL post-training (PPO, GRPO) | AdamW | RL stability is fragile; don’t change two things at once |
Run it in your browser — optimizer memory cost
Notice AdamW’s state is ~50% larger than Lion/Muon’s — that’s the practical motivation for switching, on top of any quality story.
Quick check
Key takeaways
- AdamW remains the safe default for pretraining. Production-validated, well-tooled.
- Lion and Muon halve the optimizer state. Real money for big runs; rough quality parity (Lion) or improvement (Muon) on language modeling.
- Muon orthogonalizes hidden-layer updates. Acts as a cheap preconditioner. Frontier choice in 2024–2025.
- Decoupled weight decay (the W in AdamW) is not optional. Plain Adam’s coupled WD has parameter-specific effective strength.
- Hybrid configurations are common — Muon for big matmul weights, AdamW for embeddings and heads where the math is different.
Go deeper
- PaperDecoupled Weight Decay Regularization (AdamW)The W. Short, foundational.
- PaperSymbolic Discovery of Optimization Algorithms (Lion)The Lion paper. Symbolic search found this; humans didn't design it.
- PaperMuon: An Optimizer for Hidden Layers in Neural NetworksThe Muon paper. Newton-Schulz orthogonalization plus momentum.
- PaperSophia: A Scalable Stochastic Second-order OptimizerHessian-aware. Mostly displaced by 2025 but the framing is worth the read.
- VideoSebastian Raschka — Optimizers for Deep LearningDeep visual walkthrough of SGD → Adam → AdamW.
- BlogKeller Jordan — Muon explainedAuthor-written blog. Best non-paper Muon explainer.
- Repolucidrains/lion-pytorchDrop-in Lion implementation in PyTorch. ~80 lines.
- RepoKellerJordan/MuonReference Muon implementation.
- BlogSebastian Raschka — current optimizer landscape (2025)Most recent literature digest covering 2024-2025 results.