AdamW → Lion → Muon

In a managed-runtime framework, “the optimizer” is one parameter on the trainer. optimizer="adam", ship it. PyTorch hides the same way: optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) looks like a single object. What it actually is: two FP32 tensors per model parameter (m and v, the running first and second moments of the gradient), plus an FP32 master copy of the weights, plus a fused CUDA kernel that runs every training step.

For a 70B model, those m, v buffers are 560 GB if you keep them in FP32 — twice the size of the model. Optimizer state is what FSDP and ZeRO-1 shard across data-parallel ranks. It dominates the size of every intermediate checkpoint. It dictates whether you can fit your run on the GPUs you have.

Each named optimizer in this lesson — SGD-momentum, , , — is a different choice about what state to keep per parameter. The systems tradeoff is real because that state lives on every GPU and travels every checkpoint. The math tradeoff is also real, but it’s the systems consequence that pushed Lion and Muon out of research and into frontier production runs in 2024–2025.

TL;DR

SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to 1× params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.

Mental model — what each adds over plain SGD

Each step trades cost for capability or strips away a redundant component.

SGD with momentum


m_t = β · m_{t-1} + g_t
x_t = x_{t-1} - η · m_t

State per parameter: 1 buffer ( $m$ ). Cheap. Hyperparameters: $\eta, \beta$ .

AdamW (the workhorse)


m_t = β1 · m_{t-1} + (1 - β1) · g_t
v_t = β2 · v_{t-1} + (1 - β2) · g_t²
m̂_t = m_t / (1 - β1^t)              # bias correction
v̂_t = v_t / (1 - β2^t)
x_t = x_{t-1} - η · ( m̂_t / (√v̂_t + ε) + λ · x_{t-1} )   # decoupled WD

State per parameter: 2 buffers ( $m$ , $v$ ). Standard mixed-precision Adam keeps FP32 (4 bytes) + FP32 $m$ (4 bytes) + FP32 $v$ (4 bytes) = 12 bytes / parameter of optimizer-side state, on top of the BF16 model weights themselves.

For 70B params, that’s ~840 GB of optimizer state if $m, v$ stay in FP32. Many production runs keep $m, v$ in BF16 (FP32 master weights remain), giving 4 + 2 + 2 = 8 bytes / parameter ≈ ~560 GB. ZeRO-1 / FSDP shard this across data-parallel ranks.

Why decoupled weight decay (the W in AdamW)? Vanilla Adam adds weight decay through $g_t \mathrel{+}= \lambda x$ , which gets divided by $\sqrt{v_t}$ — making the effective decay strength parameter-specific. AdamW applies decay directly to $x_t$ , decoupled from the gradient scaling. Result: weight decay actually does what you think it does.

Lion (the surprising minimalist)


c_t = β1 · m_{t-1} + (1 - β1) · g_t       # interpolated direction
x_t = x_{t-1} - η · ( sign(c_t) + λ · x_{t-1} )
m_t = β2 · m_{t-1} + (1 - β2) · g_t       # update momentum (with different β2 than usage)

State per parameter: 1 buffer ( $m$ ). Half the optimizer memory of AdamW.

The radical move is sign() — every update component is ±1, scaled by the learning rate. No second moments. No per-parameter scaling. Discovered via symbolic search by Google Brain (Chen et al., 2023). Empirically matches AdamW on many language tasks; needs a smaller learning rate (typically 1/3 to 1/10 of AdamW’s).

Muon (the 2024 frontier)

Muon orthogonalizes the momentum-buffered update via a few Newton-Schulz iterations for hidden 2D weight matrices only (embedding and output layers stay AdamW). The intuition: gradient updates for matmul weights have a meaningful “preferred direction”; orthogonalizing the update preserves expressiveness and acts as a kind of preconditioner.


def muon_step(W, m, g, lr, beta=0.95, ns_steps=5):
    """Muon: momentum buffer + Newton-Schulz orthogonalization (Keller Jordan, 2024)."""
    # 1) update momentum buffer in-place with new gradient
    m.mul_(beta).add_(g)
    # 2) start NS from the normalized momentum-buffered gradient
    X = m / (m.norm() + 1e-7)
    if X.shape[0] > X.shape[1]:        # work on the smaller side
        X = X.T
    # 3) quintic Newton-Schulz iteration. Coefficients (a, b, c) chosen so the
    #    fixed point is X = U Vᵀ — the orthogonal polar factor of m.
    a, b, c = 3.4445, -4.7750, 2.0315
    for _ in range(ns_steps):
        A = X @ X.T
        B = b * A + c * (A @ A)
        X = a * X + B @ X
    if m.shape[0] > m.shape[1]:
        X = X.T
    W.add_(X, alpha=-lr)
    return W

The (3.4445, −4.7750, 2.0315) coefficients aren’t tunable — they’re the unique solution of a polynomial-fit problem so that 5 iterations get you to the orthogonal polar factor of m to ~5 decimal places (Bernstein–Newhouse 2024). Muon’s whole win lives in this iteration; copying the wrong coefficients silently gives you an inferior optimizer.

State: 1 buffer ( $m$ ), like Lion. Compute: a few extra matmuls per step (the NS iterations) — meaningful but small fraction of total.

Empirically: ~30% faster convergence per token on small/medium models, and the gap holds at scale in the public reports through 2024. Used in production by labs through 2025; see Llama-4 family release notes for hybrid AdamW-Muon configs.

Real-world picks (April 2026)

Use case	Optimizer	Why
Pretraining, big LLM	AdamW (still the safest default)	Production-validated, no surprises
Pretraining, frontier labs experimenting	Muon for hidden, AdamW for embeddings/heads	30% faster convergence on a curve that compounds
Resource-constrained pretraining	Lion	Half the optimizer memory, often equal quality
LoRA / fine-tuning	AdamW with paged optimizers (bitsandbytes)	Adapters are tiny anyway; reuse what works
RL post-training (PPO, GRPO)	AdamW	RL stability is fragile; don’t change two things at once

Run it in your browser — optimizer memory cost

Python — editableCompare optimizer state across SGD, AdamW, Lion, Muon for several model sizes.

def state_bytes(params, optimizer, dtype_bytes=2, fp32_master=True):
  """Optimizer state size in bytes — excludes the params themselves."""
  base = 4 if fp32_master else 0  # FP32 master copy of params
  if optimizer == 'sgd':       state = 1 * dtype_bytes
  elif optimizer == 'adamw':   state = 2 * dtype_bytes
  elif optimizer == 'lion':    state = 1 * dtype_bytes
  elif optimizer == 'muon':    state = 1 * dtype_bytes      # plus tiny NS scratch, ignored
  return params * (base + state)

for params_b in [1, 7, 70, 405]:
  params = params_b * 1e9
  print(f"\n--- {params_b}B model ---")
  for opt in ['sgd', 'adamw', 'lion', 'muon']:
      gb = state_bytes(params, opt) / 1024**3
      print(f"  {opt:<6}  optimizer state = {gb:>7.1f} GB  ({gb / params_b:.2f} GB per B params)")

def state_bytes(params, optimizer, dtype_bytes=2, fp32_master=True):
  """Optimizer state size in bytes — excludes the params themselves."""
  base = 4 if fp32_master else 0  # FP32 master copy of params
  if optimizer == 'sgd':       state = 1 * dtype_bytes
  elif optimizer == 'adamw':   state = 2 * dtype_bytes
  elif optimizer == 'lion':    state = 1 * dtype_bytes
  elif optimizer == 'muon':    state = 1 * dtype_bytes      # plus tiny NS scratch, ignored
  return params * (base + state)

for params_b in [1, 7, 70, 405]:
  params = params_b * 1e9
  print(f"\n--- {params_b}B model ---")
  for opt in ['sgd', 'adamw', 'lion', 'muon']:
      gb = state_bytes(params, opt) / 1024**3
      print(f"  {opt:<6}  optimizer state = {gb:>7.1f} GB  ({gb / params_b:.2f} GB per B params)")

def state_bytes(params, optimizer, dtype_bytes=2, fp32_master=True):
  """Optimizer state size in bytes — excludes the params themselves."""
  base = 4 if fp32_master else 0  # FP32 master copy of params
  if optimizer == 'sgd':       state = 1 * dtype_bytes
  elif optimizer == 'adamw':   state = 2 * dtype_bytes
  elif optimizer == 'lion':    state = 1 * dtype_bytes
  elif optimizer == 'muon':    state = 1 * dtype_bytes      # plus tiny NS scratch, ignored
  return params * (base + state)

for params_b in [1, 7, 70, 405]:
  params = params_b * 1e9
  print(f"\n--- {params_b}B model ---")
  for opt in ['sgd', 'adamw', 'lion', 'muon']:
      gb = state_bytes(params, opt) / 1024**3
      print(f"  {opt:<6}  optimizer state = {gb:>7.1f} GB  ({gb / params_b:.2f} GB per B params)")

Ctrl+Enter to run

Notice AdamW’s state is ~50% larger than Lion/Muon’s — that’s the practical motivation for switching, on top of any quality story.

Quick check

You're constrained to 80 GB of GPU memory and want to train a 13B model. AdamW is OOMing on optimizer state alone. Which is the most practical fix that doesn't change architectures?

Key takeaways

AdamW remains the safe default for pretraining. Production-validated, well-tooled.
Lion and Muon halve the optimizer state. Real money for big runs; rough quality parity (Lion) or improvement (Muon) on language modeling.
Muon orthogonalizes hidden-layer updates. Acts as a cheap preconditioner. Frontier choice in 2024–2025.
Decoupled weight decay (the W in AdamW) is not optional. Plain Adam’s coupled WD has parameter-specific effective strength.
Hybrid configurations are common — Muon for big matmul weights, AdamW for embeddings and heads where the math is different.

Go deeper

PaperDecoupled Weight Decay Regularization (AdamW) · Loshchilov & Hutter (2017)The W. Short, foundational.
PaperSymbolic Discovery of Optimization Algorithms (Lion) · Chen, Liang, Huang et al. (Google Brain, 2023)The Lion paper. Symbolic search found this; humans didn't design it.
PaperMuon: An Optimizer for Hidden Layers in Neural Networks · Jordan, Jin et al. (2024)The Muon paper. Newton-Schulz orthogonalization plus momentum.
PaperSophia: A Scalable Stochastic Second-order Optimizer · Liu et al. (2023)Hessian-aware. Mostly displaced by 2025 but the framing is worth the read.
VideoSebastian Raschka — Optimizers for Deep Learning · Sebastian RaschkaDeep visual walkthrough of SGD → Adam → AdamW.
BlogKeller Jordan — Muon explained · Keller Jordan (lead author)Author-written blog. Best non-paper Muon explainer.
Repolucidrains/lion-pytorchDrop-in Lion implementation in PyTorch. ~80 lines.
RepoKellerJordan/MuonReference Muon implementation.
BlogSebastian Raschka — current optimizer landscape (2025)Most recent literature digest covering 2024-2025 results.

TL;DR

SGD-momentum uses one running average of gradients. Cheap (1× param state) but slow on ill-conditioned losses.
Adam / AdamW add a per-parameter scaling via second-moment estimates. Costs 2× param state in optimizer memory but converges robustly. The default for ~all LLM pretraining 2018–2024.
Lion (Chen et al., 2023): one running average, sign-of-update only. Cuts optimizer state to 1× params, often matches AdamW. Used in PaLM follow-ups, occasionally in Llama-class runs.
Muon (Jordan et al., 2024): orthogonalize updates via Newton-Schulz iteration. Frontier for hidden-layer parameters in 2024–2025; Llama-4 reportedly uses Muon-flavored optimizers in 2025.
Sophia (Hessian-aware) was a 2023 candidate; mostly displaced by Lion/Muon in 2025 due to compute cost.

Why this matters

For a 70B AdamW run, optimizer state is ~280 GB — 2× the weights. That’s why FSDP + ZeRO matter, and why Lion/Muon’s lower-state variants are economically meaningful. Saving optimizer state often dominates checkpoint size; saving optimizer state per-step dominates inter-GPU communication in Tensor Parallel.

Also: AdamW is not always the best choice. Lion at the same learning rate often equals or beats AdamW on language modeling with half the memory. The community switched from “AdamW always” to “test the new ones on your run” sometime in 2024.

Mental model — what each adds over plain SGD

Each step trades cost for capability or strips away a redundant component.

Concrete walkthrough

SGD with momentum


m_t = β · m_{t-1} + g_t
x_t = x_{t-1} - η · m_t

State per parameter: 1 buffer ( $m$ ). Cheap. Hyperparameters: $\eta, \beta$ .

AdamW (the workhorse)


m_t = β1 · m_{t-1} + (1 - β1) · g_t
v_t = β2 · v_{t-1} + (1 - β2) · g_t²
m̂_t = m_t / (1 - β1^t)              # bias correction
v̂_t = v_t / (1 - β2^t)
x_t = x_{t-1} - η · ( m̂_t / (√v̂_t + ε) + λ · x_{t-1} )   # decoupled WD

State per parameter: 2 buffers ( $m$ , $v$ ). Standard mixed-precision Adam keeps FP32 master weights (4 bytes) + FP32 $m$ (4 bytes) + FP32 $v$ (4 bytes) = 12 bytes / parameter of optimizer-side state, on top of the BF16 model weights themselves.

Lion (the surprising minimalist)


c_t = β1 · m_{t-1} + (1 - β1) · g_t       # interpolated direction
x_t = x_{t-1} - η · ( sign(c_t) + λ · x_{t-1} )
m_t = β2 · m_{t-1} + (1 - β2) · g_t       # update momentum (with different β2 than usage)

State per parameter: 1 buffer ( $m$ ). Half the optimizer memory of AdamW.

Muon (the 2024 frontier)


def muon_step(W, m, g, lr, beta=0.95, ns_steps=5):
    """Muon: momentum buffer + Newton-Schulz orthogonalization (Keller Jordan, 2024)."""
    # 1) update momentum buffer in-place with new gradient
    m.mul_(beta).add_(g)
    # 2) start NS from the normalized momentum-buffered gradient
    X = m / (m.norm() + 1e-7)
    if X.shape[0] > X.shape[1]:        # work on the smaller side
        X = X.T
    # 3) quintic Newton-Schulz iteration. Coefficients (a, b, c) chosen so the
    #    fixed point is X = U Vᵀ — the orthogonal polar factor of m.
    a, b, c = 3.4445, -4.7750, 2.0315
    for _ in range(ns_steps):
        A = X @ X.T
        B = b * A + c * (A @ A)
        X = a * X + B @ X
    if m.shape[0] > m.shape[1]:
        X = X.T
    W.add_(X, alpha=-lr)
    return W

State: 1 buffer ( $m$ ), like Lion. Compute: a few extra matmuls per step (the NS iterations) — meaningful but small fraction of total.

Real-world picks (April 2026)

Use case	Optimizer	Why
Pretraining, big LLM	AdamW (still the safest default)	Production-validated, no surprises
Pretraining, frontier labs experimenting	Muon for hidden, AdamW for embeddings/heads	30% faster convergence on a curve that compounds
Resource-constrained pretraining	Lion	Half the optimizer memory, often equal quality
LoRA / fine-tuning	AdamW with paged optimizers (bitsandbytes)	Adapters are tiny anyway; reuse what works
RL post-training (PPO, GRPO)	AdamW	RL stability is fragile; don’t change two things at once

Run it in your browser — optimizer memory cost

Python — editableCompare optimizer state across SGD, AdamW, Lion, Muon for several model sizes.

def state_bytes(params, optimizer, dtype_bytes=2, fp32_master=True):
  """Optimizer state size in bytes — excludes the params themselves."""
  base = 4 if fp32_master else 0  # FP32 master copy of params
  if optimizer == 'sgd':       state = 1 * dtype_bytes
  elif optimizer == 'adamw':   state = 2 * dtype_bytes
  elif optimizer == 'lion':    state = 1 * dtype_bytes
  elif optimizer == 'muon':    state = 1 * dtype_bytes      # plus tiny NS scratch, ignored
  return params * (base + state)

for params_b in [1, 7, 70, 405]:
  params = params_b * 1e9
  print(f"\n--- {params_b}B model ---")
  for opt in ['sgd', 'adamw', 'lion', 'muon']:
      gb = state_bytes(params, opt) / 1024**3
      print(f"  {opt:<6}  optimizer state = {gb:>7.1f} GB  ({gb / params_b:.2f} GB per B params)")

def state_bytes(params, optimizer, dtype_bytes=2, fp32_master=True):
  """Optimizer state size in bytes — excludes the params themselves."""
  base = 4 if fp32_master else 0  # FP32 master copy of params
  if optimizer == 'sgd':       state = 1 * dtype_bytes
  elif optimizer == 'adamw':   state = 2 * dtype_bytes
  elif optimizer == 'lion':    state = 1 * dtype_bytes
  elif optimizer == 'muon':    state = 1 * dtype_bytes      # plus tiny NS scratch, ignored
  return params * (base + state)

for params_b in [1, 7, 70, 405]:
  params = params_b * 1e9
  print(f"\n--- {params_b}B model ---")
  for opt in ['sgd', 'adamw', 'lion', 'muon']:
      gb = state_bytes(params, opt) / 1024**3
      print(f"  {opt:<6}  optimizer state = {gb:>7.1f} GB  ({gb / params_b:.2f} GB per B params)")

def state_bytes(params, optimizer, dtype_bytes=2, fp32_master=True):
  """Optimizer state size in bytes — excludes the params themselves."""
  base = 4 if fp32_master else 0  # FP32 master copy of params
  if optimizer == 'sgd':       state = 1 * dtype_bytes
  elif optimizer == 'adamw':   state = 2 * dtype_bytes
  elif optimizer == 'lion':    state = 1 * dtype_bytes
  elif optimizer == 'muon':    state = 1 * dtype_bytes      # plus tiny NS scratch, ignored
  return params * (base + state)

for params_b in [1, 7, 70, 405]:
  params = params_b * 1e9
  print(f"\n--- {params_b}B model ---")
  for opt in ['sgd', 'adamw', 'lion', 'muon']:
      gb = state_bytes(params, opt) / 1024**3
      print(f"  {opt:<6}  optimizer state = {gb:>7.1f} GB  ({gb / params_b:.2f} GB per B params)")

Ctrl+Enter to run

Notice AdamW’s state is ~50% larger than Lion/Muon’s — that’s the practical motivation for switching, on top of any quality story.

Quick check

You're constrained to 80 GB of GPU memory and want to train a 13B model. AdamW is OOMing on optimizer state alone. Which is the most practical fix that doesn't change architectures?

Key takeaways

AdamW remains the safe default for pretraining. Production-validated, well-tooled.
Lion and Muon halve the optimizer state. Real money for big runs; rough quality parity (Lion) or improvement (Muon) on language modeling.
Muon orthogonalizes hidden-layer updates. Acts as a cheap preconditioner. Frontier choice in 2024–2025.
Decoupled weight decay (the W in AdamW) is not optional. Plain Adam’s coupled WD has parameter-specific effective strength.
Hybrid configurations are common — Muon for big matmul weights, AdamW for embeddings and heads where the math is different.

Go deeper

PaperDecoupled Weight Decay Regularization (AdamW) · Loshchilov & Hutter (2017)The W. Short, foundational.
PaperSymbolic Discovery of Optimization Algorithms (Lion) · Chen, Liang, Huang et al. (Google Brain, 2023)The Lion paper. Symbolic search found this; humans didn't design it.
PaperMuon: An Optimizer for Hidden Layers in Neural Networks · Jordan, Jin et al. (2024)The Muon paper. Newton-Schulz orthogonalization plus momentum.
PaperSophia: A Scalable Stochastic Second-order Optimizer · Liu et al. (2023)Hessian-aware. Mostly displaced by 2025 but the framing is worth the read.
VideoSebastian Raschka — Optimizers for Deep Learning · Sebastian RaschkaDeep visual walkthrough of SGD → Adam → AdamW.
BlogKeller Jordan — Muon explained · Keller Jordan (lead author)Author-written blog. Best non-paper Muon explainer.
Repolucidrains/lion-pytorchDrop-in Lion implementation in PyTorch. ~80 lines.
RepoKellerJordan/MuonReference Muon implementation.
BlogSebastian Raschka — current optimizer landscape (2025)Most recent literature digest covering 2024-2025 results.