Skip to content

AdamW → Lion → Muon

In a managed-runtime framework, “the optimizer” is one parameter on the trainer. optimizer="adam", ship it. PyTorch hides the same way: optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) looks like a single object. What it actually is: two FP32 tensors per model parameter (m and v, the running first and second moments of the gradient), plus an FP32 master copy of the weights, plus a fused CUDA kernel that runs every training step.

For a 70B model, those m, v buffers are 560 GB if you keep them in FP32 — twice the size of the model. Optimizer state is what FSDP and ZeRO-1 shard across data-parallel ranks. It dominates the size of every intermediate checkpoint. It dictates whether you can fit your run on the GPUs you have.

Each named optimizer in this lesson — SGD-momentum, , , — is a different choice about what state to keep per parameter. The systems tradeoff is real because that state lives on every GPU and travels every checkpoint. The math tradeoff is also real, but it’s the systems consequence that pushed Lion and Muon out of research and into frontier production runs in 2024–2025.

TL;DR

  • SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
  • Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
  • Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
  • Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
  • Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.

Mental model — what each adds over plain SGD

Each step trades cost for capability or strips away a redundant component.

SGD with momentum

m_t = β · m_{t-1} + g_t x_t = x_{t-1} - η · m_t

State per parameter: 1 buffer (mm). Cheap. Hyperparameters: η,β\eta, \beta.

AdamW (the workhorse)

m_t = β1 · m_{t-1} + (1 - β1) · g_t v_t = β2 · v_{t-1} + (1 - β2) · g_t² m̂_t = m_t / (1 - β1^t) # bias correction v̂_t = v_t / (1 - β2^t) x_t = x_{t-1} - η · ( m̂_t / (√v̂_t + ε) + λ · x_{t-1} ) # decoupled WD

State per parameter: 2 buffers (mm, vv). Standard mixed-precision Adam keeps FP32 (4 bytes) + FP32 mm (4 bytes) + FP32 vv (4 bytes) = 12 bytes / parameter of optimizer-side state, on top of the BF16 model weights themselves.

For 70B params, that’s ~840 GB of optimizer state if m,vm, v stay in FP32. Many production runs keep m,vm, v in BF16 (FP32 master weights remain), giving 4 + 2 + 2 = 8 bytes / parameter ≈ ~560 GB. ZeRO-1 / FSDP shard this across data-parallel ranks.

Why decoupled weight decay (the W in AdamW)? Vanilla Adam adds weight decay through gt+=λxg_t \mathrel{+}= \lambda x, which gets divided by vt\sqrt{v_t} — making the effective decay strength parameter-specific. AdamW applies decay directly to xtx_t, decoupled from the gradient scaling. Result: weight decay actually does what you think it does.

Lion (the surprising minimalist)

c_t = β1 · m_{t-1} + (1 - β1) · g_t # interpolated direction x_t = x_{t-1} - η · ( sign(c_t) + λ · x_{t-1} ) m_t = β2 · m_{t-1} + (1 - β2) · g_t # update momentum (with different β2 than usage)

State per parameter: 1 buffer (mm). Half the optimizer memory of AdamW.

The radical move is sign() — every update component is ±1, scaled by the learning rate. No second moments. No per-parameter scaling. Discovered via symbolic search by Google Brain (Chen et al., 2023). Empirically matches AdamW on many language tasks; needs a smaller learning rate (typically 1/3 to 1/10 of AdamW’s).

Muon (the 2024 frontier)

Muon orthogonalizes the momentum-buffered update via a few Newton-Schulz iterations for hidden 2D weight matrices only (embedding and output layers stay AdamW). The intuition: gradient updates for matmul weights have a meaningful “preferred direction”; orthogonalizing the update preserves expressiveness and acts as a kind of preconditioner.

def muon_step(W, m, g, lr, beta=0.95, ns_steps=5): """Muon: momentum buffer + Newton-Schulz orthogonalization (Keller Jordan, 2024).""" # 1) update momentum buffer in-place with new gradient m.mul_(beta).add_(g) # 2) start NS from the normalized momentum-buffered gradient X = m / (m.norm() + 1e-7) if X.shape[0] > X.shape[1]: # work on the smaller side X = X.T # 3) quintic Newton-Schulz iteration. Coefficients (a, b, c) chosen so the # fixed point is X = U Vᵀ — the orthogonal polar factor of m. a, b, c = 3.4445, -4.7750, 2.0315 for _ in range(ns_steps): A = X @ X.T B = b * A + c * (A @ A) X = a * X + B @ X if m.shape[0] > m.shape[1]: X = X.T W.add_(X, alpha=-lr) return W

The (3.4445, −4.7750, 2.0315) coefficients aren’t tunable — they’re the unique solution of a polynomial-fit problem so that 5 iterations get you to the orthogonal polar factor of m to ~5 decimal places (Bernstein–Newhouse 2024). Muon’s whole win lives in this iteration; copying the wrong coefficients silently gives you an inferior optimizer.

State: 1 buffer (mm), like Lion. Compute: a few extra matmuls per step (the NS iterations) — meaningful but small fraction of total.

Empirically: ~30% faster convergence per token on small/medium models, and the gap holds at scale in the public reports through 2024. Used in production by labs through 2025; see Llama-4 family release notes for hybrid AdamW-Muon configs.

Real-world picks (April 2026)

Use caseOptimizerWhy
Pretraining, big LLMAdamW (still the safest default)Production-validated, no surprises
Pretraining, frontier labs experimentingMuon for hidden, AdamW for embeddings/heads30% faster convergence on a curve that compounds
Resource-constrained pretrainingLionHalf the optimizer memory, often equal quality
LoRA / fine-tuningAdamW with paged optimizers (bitsandbytes)Adapters are tiny anyway; reuse what works
RL post-training (PPO, GRPO)AdamWRL stability is fragile; don’t change two things at once

Run it in your browser — optimizer memory cost

Python — editableCompare optimizer state across SGD, AdamW, Lion, Muon for several model sizes.
Ctrl+Enter to run

Notice AdamW’s state is ~50% larger than Lion/Muon’s — that’s the practical motivation for switching, on top of any quality story.

Quick check

Quick check
You're constrained to 80 GB of GPU memory and want to train a 13B model. AdamW is OOMing on optimizer state alone. Which is the most practical fix that doesn't change architectures?

Key takeaways

  1. AdamW remains the safe default for pretraining. Production-validated, well-tooled.
  2. Lion and Muon halve the optimizer state. Real money for big runs; rough quality parity (Lion) or improvement (Muon) on language modeling.
  3. Muon orthogonalizes hidden-layer updates. Acts as a cheap preconditioner. Frontier choice in 2024–2025.
  4. Decoupled weight decay (the W in AdamW) is not optional. Plain Adam’s coupled WD has parameter-specific effective strength.
  5. Hybrid configurations are common — Muon for big matmul weights, AdamW for embeddings and heads where the math is different.

Go deeper

TL;DR

  • SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
  • Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
  • Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
  • Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
  • Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.

Why this matters

For a 70B AdamW run, optimizer state is ~280 GB — 2× the weights. That’s why FSDP + ZeRO matter, and why Lion/Muon’s lower-state variants are economically meaningful. Saving optimizer state often dominates checkpoint size; saving optimizer state per-step dominates inter-GPU communication in Tensor Parallel.

Also: AdamW is not always the best choice. Lion at the same learning rate often equals or beats AdamW on language modeling with half the memory. The community switched from “AdamW always” to “test the new ones on your run” sometime in 2024.

Mental model — what each adds over plain SGD

Each step trades cost for capability or strips away a redundant component.

Concrete walkthrough

SGD with momentum

m_t = β · m_{t-1} + g_t x_t = x_{t-1} - η · m_t

State per parameter: 1 buffer (mm). Cheap. Hyperparameters: η,β\eta, \beta.

AdamW (the workhorse)

m_t = β1 · m_{t-1} + (1 - β1) · g_t v_t = β2 · v_{t-1} + (1 - β2) · g_t² m̂_t = m_t / (1 - β1^t) # bias correction v̂_t = v_t / (1 - β2^t) x_t = x_{t-1} - η · ( m̂_t / (√v̂_t + ε) + λ · x_{t-1} ) # decoupled WD

State per parameter: 2 buffers (mm, vv). Standard mixed-precision Adam keeps FP32 master weights (4 bytes) + FP32 mm (4 bytes) + FP32 vv (4 bytes) = 12 bytes / parameter of optimizer-side state, on top of the BF16 model weights themselves.

For 70B params, that’s ~840 GB of optimizer state if m,vm, v stay in FP32. Many production runs keep m,vm, v in BF16 (FP32 master weights remain), giving 4 + 2 + 2 = 8 bytes / parameter ≈ ~560 GB. ZeRO-1 / FSDP shard this across data-parallel ranks.

Why decoupled weight decay (the W in AdamW)? Vanilla Adam adds weight decay through gt+=λxg_t \mathrel{+}= \lambda x, which gets divided by vt\sqrt{v_t} — making the effective decay strength parameter-specific. AdamW applies decay directly to xtx_t, decoupled from the gradient scaling. Result: weight decay actually does what you think it does.

Lion (the surprising minimalist)

c_t = β1 · m_{t-1} + (1 - β1) · g_t # interpolated direction x_t = x_{t-1} - η · ( sign(c_t) + λ · x_{t-1} ) m_t = β2 · m_{t-1} + (1 - β2) · g_t # update momentum (with different β2 than usage)

State per parameter: 1 buffer (mm). Half the optimizer memory of AdamW.

The radical move is sign() — every update component is ±1, scaled by the learning rate. No second moments. No per-parameter scaling. Discovered via symbolic search by Google Brain (Chen et al., 2023). Empirically matches AdamW on many language tasks; needs a smaller learning rate (typically 1/3 to 1/10 of AdamW’s).

Muon (the 2024 frontier)

Muon orthogonalizes the momentum-buffered update via a few Newton-Schulz iterations for hidden 2D weight matrices only (embedding and output layers stay AdamW). The intuition: gradient updates for matmul weights have a meaningful “preferred direction”; orthogonalizing the update preserves expressiveness and acts as a kind of preconditioner.

def muon_step(W, m, g, lr, beta=0.95, ns_steps=5): """Muon: momentum buffer + Newton-Schulz orthogonalization (Keller Jordan, 2024).""" # 1) update momentum buffer in-place with new gradient m.mul_(beta).add_(g) # 2) start NS from the normalized momentum-buffered gradient X = m / (m.norm() + 1e-7) if X.shape[0] > X.shape[1]: # work on the smaller side X = X.T # 3) quintic Newton-Schulz iteration. Coefficients (a, b, c) chosen so the # fixed point is X = U Vᵀ — the orthogonal polar factor of m. a, b, c = 3.4445, -4.7750, 2.0315 for _ in range(ns_steps): A = X @ X.T B = b * A + c * (A @ A) X = a * X + B @ X if m.shape[0] > m.shape[1]: X = X.T W.add_(X, alpha=-lr) return W

The (3.4445, −4.7750, 2.0315) coefficients aren’t tunable — they’re the unique solution of a polynomial-fit problem so that 5 iterations get you to the orthogonal polar factor of m to ~5 decimal places (Bernstein–Newhouse 2024). Muon’s whole win lives in this iteration; copying the wrong coefficients silently gives you an inferior optimizer.

State: 1 buffer (mm), like Lion. Compute: a few extra matmuls per step (the NS iterations) — meaningful but small fraction of total.

Empirically: ~30% faster convergence per token on small/medium models, and the gap holds at scale in the public reports through 2024. Used in production by labs through 2025; see Llama-4 family release notes for hybrid AdamW-Muon configs.

Real-world picks (April 2026)

Use caseOptimizerWhy
Pretraining, big LLMAdamW (still the safest default)Production-validated, no surprises
Pretraining, frontier labs experimentingMuon for hidden, AdamW for embeddings/heads30% faster convergence on a curve that compounds
Resource-constrained pretrainingLionHalf the optimizer memory, often equal quality
LoRA / fine-tuningAdamW with paged optimizers (bitsandbytes)Adapters are tiny anyway; reuse what works
RL post-training (PPO, GRPO)AdamWRL stability is fragile; don’t change two things at once

Run it in your browser — optimizer memory cost

Python — editableCompare optimizer state across SGD, AdamW, Lion, Muon for several model sizes.
Ctrl+Enter to run

Notice AdamW’s state is ~50% larger than Lion/Muon’s — that’s the practical motivation for switching, on top of any quality story.

Quick check

Quick check
You're constrained to 80 GB of GPU memory and want to train a 13B model. AdamW is OOMing on optimizer state alone. Which is the most practical fix that doesn't change architectures?

Key takeaways

  1. AdamW remains the safe default for pretraining. Production-validated, well-tooled.
  2. Lion and Muon halve the optimizer state. Real money for big runs; rough quality parity (Lion) or improvement (Muon) on language modeling.
  3. Muon orthogonalizes hidden-layer updates. Acts as a cheap preconditioner. Frontier choice in 2024–2025.
  4. Decoupled weight decay (the W in AdamW) is not optional. Plain Adam’s coupled WD has parameter-specific effective strength.
  5. Hybrid configurations are common — Muon for big matmul weights, AdamW for embeddings and heads where the math is different.

Go deeper