Skip to content

DPO / IPO / KTO

Prereqs: SFT & Instruction Tune. DPO is what you do after SFT to teach preferences without the RL machinery.

The 2022-era post-training stack looked like a distributed-systems problem. You trained an model, then trained a separate reward-model network on human preference labels, then ran with the reward model in the loop — sample a rollout from the policy, score it, KL-penalize against the SFT reference, gradient step, repeat. Three networks in memory. Online sampling on every step. A whole RL infra team to keep it running.

Then in May 2023 a four-page paper showed that the optimal policy under that whole RLHF setup has a closed form — and if you plug it back into the Bradley-Terry preference model, the reward function cancels out. The policy itself, divided by a frozen reference, is the implicit reward model. The result is a single supervised loss on (prompt, chosen, rejected) triples. No reward model. No rollouts. No RL.

That paper is , and within a year almost every open post-training pipeline had switched. This lesson is the loss, the variants (IPO, KTO), and the failure modes — because as much as DPO is “just supervised learning,” the failure modes are real.

TL;DR

  • DPO (Direct Preference Optimization) — Rafailov et al., 2023 — replaces PPO-RLHF with a closed-form classification loss on chosen vs rejected response pairs. No reward model, no RL, no rollouts. Same gradient direction; vastly simpler training.
  • The DPO loss derives from the same Bradley-Terry preference model RLHF uses, but with the reward function eliminated analytically — the policy is its own implicit reward model.
  • IPO (Identity Preference Optimization) — Azar et al., 2023 — a regularization fix for DPO that prevents over-optimization when preference data is noisy or near-tied.
  • KTO (Kahneman-Tversky Optimization) — ContextualAI, 2024 — uses a single thumbs-up / thumbs-down label per response (no pairs needed). Closer to real production feedback signal.
  • For 2026 production: DPO is the default for paired-preference data; KTO when you only have unary feedback. IPO and other variants come up when DPO drifts.

Mental model

The model trains to make its log-probability ratio of chosen-vs-rejected larger relative to the . No reward model, no sampling — just two forward passes per training example.

The DPO loss

Given a prompt xx, a chosen response ywy_w (“winner”), a rejected response yly_l (“loser”), and a frozen reference policy πref\pi_\text{ref} (typically your SFT’d model), the DPO loss is:

LDPO=logσ(β[logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)])\mathcal{L}_\text{DPO} = -\log \sigma\left( \beta \cdot \left[ \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)} \right] \right)

In words:

  • Compute the ratio of the current policy’s probability to the reference policy’s probability for both the chosen and rejected response.
  • Take the difference of those ratios.
  • Apply a sigmoid ( → binary classification).
  • Negative log-likelihood as the loss.

The hyperparameter β\beta controls how much the policy is allowed to deviate from the reference. Typical values: 0.1–0.5. Higher β = stay closer to reference (less risk, less gain); lower β = move further (more risk, more reward).

Why this works (intuition)

The DPO paper proves: if you set up the standard objective (maximize reward subject to KL constraint vs reference) and analytically eliminate the reward function from the optimal policy expression, you get exactly the loss above. The policy itself, divided by the reference, is the implicit reward model.

The practical consequence: training the policy with this loss is equivalent to training it with the optimal RLHF reward — but without ever sampling from the policy or training a separate reward model. Everything is offline supervised learning.

The training loop

from torch.nn.functional import logsigmoid def dpo_loss(model, ref_model, prompts, chosen, rejected, beta=0.1): """Compute DPO loss on a batch of preference pairs.""" # Two forward passes through the current policy chosen_logits = model(prompts + chosen).logits rejected_logits = model(prompts + rejected).logits # Two forward passes through the reference (no grad) with torch.no_grad(): chosen_logits_ref = ref_model(prompts + chosen).logits rejected_logits_ref = ref_model(prompts + rejected).logits # Sum log-probs over the response tokens chosen_logps = sum_token_logps(chosen_logits, chosen) rejected_logps = sum_token_logps(rejected_logits, rejected) chosen_logps_ref = sum_token_logps(chosen_logits_ref, chosen) rejected_logps_ref = sum_token_logps(rejected_logits_ref, rejected) # The DPO loss pi_logratio = chosen_logps - rejected_logps ref_logratio = chosen_logps_ref - rejected_logps_ref return -logsigmoid(beta * (pi_logratio - ref_logratio)).mean() # Train loop for batch in preference_dataloader: loss = dpo_loss(model, ref_model, batch.prompts, batch.chosen, batch.rejected) loss.backward() optimizer.step()

That’s the entire training loop. Each example needs 4 forward passes (chosen + rejected, current + ref). With smart caching of the ref-model passes (since the ref doesn’t change), you can pre-compute reference log-probs once and store them, halving training compute.

Hyperparameters that matter

KnobDefaultNotes
β0.1Higher = closer to ref (less divergence). Tune on eval.
Learning rate1e-7 to 5e-7Way smaller than SFT. DPO is delicate.
Number of epochs1–3More is usually worse. DPO over-fits preference data fast.
Batch size32–256 (preference pairs)Smaller than SFT batches because each pair is two sequences.

The most common DPO failure mode: over-optimization. With β too low or too many epochs, the policy drifts so far from the reference that it generates pathological responses that satisfy the preference data but are bad in general. Always evaluate on held-out preference pairs and on general benchmarks (MMLU, etc.); both should improve or stay flat. If MMLU drops, β is too low or training too long.

IPO — the regularization fix

IPO (Identity Preference Optimization, Azar et al., 2023) adds a square-loss-shaped regularizer that fights the over-optimization tendency. The math is similar but the loss penalizes large log π / π_ref ratios beyond what the data supports.

In practice, IPO matters most when:

  • Preference data is noisy (annotators disagree).
  • Many “near-tie” pairs (chosen barely beats rejected).
  • The base model is already strong (less room to grow).

For most tasks, DPO works well. Reach for IPO when DPO is over-fitting on your eval.

KTO — the no-pair version

(Kahneman-Tversky Optimization, ContextualAI 2024) uses unary feedback: one response, one label (good/bad). No pairs needed.

The loss derives from prospect theory (loss aversion); it treats the cost of a “bad” response as proportionally larger than the gain of a “good” one. Training-wise, it looks similar to DPO but each example contributes one term, not two:

# Unary preference: (prompt, response, label∈{good, bad}) if label == 'good': loss = -logsigmoid(β · (logp - logp_ref) - λ_good) else: loss = -logsigmoid(λ_bad - β · (logp - logp_ref))

KTO is the right choice when your production feedback is unary — thumbs-up/down, click vs no-click, success vs failure. You don’t have to construct artificial pairs from production logs. This makes KTO compelling for product-feedback-driven post-training.

What DPO replaced and why

The 2022-era post-training stack:

  1. SFT
  2. Train a reward model (a separate transformer that scores responses).
  3. PPO-RLHF: sample from policy → score with reward model → PPO step → repeat. Includes KL penalty against SFT model.

The 2023+ DPO stack:

  1. SFT
  2. DPO on preference pairs.

That’s it. No reward model, no rollouts, no RL infrastructure. Same paper-level results on standard benchmarks. The simplification was so dramatic that within a year of DPO’s release, most open-source post-training pipelines (Llama-3, Mistral, Qwen 2, DeepSeek series) had switched.

PPO-RLHF still has uses — when reward signal is too sparse or non-verifiable for offline preference data — but for “make the model prefer chosen over rejected” workflows, DPO won decisively.

And then GRPO came along

For reasoning workloads (math, code, formal logic), DPO has limits — it works on the response level, not on intermediate reasoning steps. (DeepSeek-R1’s recipe — see GRPO & RL Reasoning) brings RL back specifically for verifiable-reward reasoning. The 2026 picture:

  • General preferences (helpfulness, style, safety) → DPO.
  • Verifiable rewards (math, code, exact answers) → GRPO.
  • Most production stacks use both: DPO for chat polish, GRPO for reasoning power.

Run it in your browser — DPO loss simulator

Python — editableCompute the DPO loss for synthetic logprobs; see how β and the policy-vs-reference gap interact.
Ctrl+Enter to run

The output shape — loss decreasing as the policy’s preference gap grows, faster for higher β — is the entire DPO training dynamic in miniature.

Quick check

Fill in the blank
The 2024 preference-optimization variant that uses unary thumbs-up/thumbs-down feedback instead of paired chosen-vs-rejected data:
Three letters; named after a Nobel Prize–winning psychology pair.
Quick check
A team running DPO sees their model improving on the preference benchmark but regressing on MMLU by 4 points. Most likely cause:

Key takeaways

  1. DPO replaced PPO-RLHF as the post-training default. No reward model, no rollouts, no RL.
  2. The loss is closed-form: a sigmoid over the difference of policy-vs-reference log-ratios for chosen and rejected.
  3. β is the KL regularizer. Too low → over-optimization (good on preference, bad on general capability). Tune on held-out eval.
  4. IPO regularizes DPO for noisy / near-tied preference data. KTO uses unary feedback instead of pairs.
  5. DPO + GRPO is the 2026 production combo. DPO for general preferences; GRPO for verifiable-reward reasoning.

Go deeper

Prereqs: SFT & Instruction Tune. DPO is what you do after SFT to teach preferences without the RL machinery.

TL;DR

  • DPO (Direct Preference Optimization) — Rafailov et al., 2023 — replaces PPO-RLHF with a closed-form classification loss on chosen vs rejected response pairs. No reward model, no RL, no rollouts. Same gradient direction; vastly simpler training.
  • The DPO loss derives from the same Bradley-Terry preference model RLHF uses, but with the reward function eliminated analytically — the policy is its own implicit reward model.
  • IPO (Identity Preference Optimization) — Azar et al., 2023 — a regularization fix for DPO that prevents over-optimization when preference data is noisy or near-tied.
  • KTO (Kahneman-Tversky Optimization) — ContextualAI, 2024 — uses a single thumbs-up / thumbs-down label per response (no pairs needed). Closer to real production feedback signal.
  • For 2026 production: DPO is the default for paired-preference data; KTO when you only have unary feedback. IPO and other variants come up when DPO drifts.

Why this matters

PPO-RLHF (2022 GPT-3.5 / GPT-4 era) requires rollout collection, a separate reward model, and PPO’s complex training loop with KL penalties. DPO collapses this into a single supervised loss — same idea, ~10× simpler infrastructure. Almost every open-model post-training pipeline in 2023–2026 uses DPO or a variant because the engineering cost of PPO-RLHF was prohibitive for any team without an RL infra group. Knowing DPO is the price of admission for any post-training conversation today.

Mental model

The model trains to make its log-probability ratio of chosen-vs-rejected larger relative to the reference policy. No reward model, no sampling — just two forward passes per training example.

Concrete walkthrough

The DPO loss

Given a prompt xx, a chosen response ywy_w (“winner”), a rejected response yly_l (“loser”), and a frozen reference policy πref\pi_\text{ref} (typically your SFT’d model), the DPO loss is:

LDPO=logσ(β[logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)])\mathcal{L}_\text{DPO} = -\log \sigma\left( \beta \cdot \left[ \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)} \right] \right)

In words:

  • Compute the ratio of the current policy’s probability to the reference policy’s probability for both the chosen and rejected response.
  • Take the difference of those ratios.
  • Apply a sigmoid (Bradley-Terry → binary classification).
  • Negative log-likelihood as the loss.

The hyperparameter β\beta controls how much the policy is allowed to deviate from the reference. Typical values: 0.1–0.5. Higher β = stay closer to reference (less risk, less gain); lower β = move further (more risk, more reward).

Why this works (intuition)

The DPO paper proves: if you set up the standard RLHF objective (maximize reward subject to KL constraint vs reference) and analytically eliminate the reward function from the optimal policy expression, you get exactly the loss above. The policy itself, divided by the reference, is the implicit reward model.

The practical consequence: training the policy with this loss is equivalent to training it with the optimal RLHF reward — but without ever sampling from the policy or training a separate reward model. Everything is offline supervised learning.

The training loop

from torch.nn.functional import logsigmoid def dpo_loss(model, ref_model, prompts, chosen, rejected, beta=0.1): """Compute DPO loss on a batch of preference pairs.""" # Two forward passes through the current policy chosen_logits = model(prompts + chosen).logits rejected_logits = model(prompts + rejected).logits # Two forward passes through the reference (no grad) with torch.no_grad(): chosen_logits_ref = ref_model(prompts + chosen).logits rejected_logits_ref = ref_model(prompts + rejected).logits # Sum log-probs over the response tokens chosen_logps = sum_token_logps(chosen_logits, chosen) rejected_logps = sum_token_logps(rejected_logits, rejected) chosen_logps_ref = sum_token_logps(chosen_logits_ref, chosen) rejected_logps_ref = sum_token_logps(rejected_logits_ref, rejected) # The DPO loss pi_logratio = chosen_logps - rejected_logps ref_logratio = chosen_logps_ref - rejected_logps_ref return -logsigmoid(beta * (pi_logratio - ref_logratio)).mean() # Train loop for batch in preference_dataloader: loss = dpo_loss(model, ref_model, batch.prompts, batch.chosen, batch.rejected) loss.backward() optimizer.step()

That’s the entire training loop. Each example needs 4 forward passes (chosen + rejected, current + ref). With smart caching of the ref-model passes (since the ref doesn’t change), you can pre-compute reference log-probs once and store them, halving training compute.

Hyperparameters that matter

KnobDefaultNotes
β0.1Higher = closer to ref (less divergence). Tune on eval.
Learning rate1e-7 to 5e-7Way smaller than SFT. DPO is delicate.
Number of epochs1–3More is usually worse. DPO over-fits preference data fast.
Batch size32–256 (preference pairs)Smaller than SFT batches because each pair is two sequences.

The most common DPO failure mode: over-optimization. With β too low or too many epochs, the policy drifts so far from the reference that it generates pathological responses that satisfy the preference data but are bad in general. Always evaluate on held-out preference pairs and on general benchmarks (MMLU, etc.); both should improve or stay flat. If MMLU drops, β is too low or training too long.

IPO — the regularization fix

IPO (Identity Preference Optimization, Azar et al., 2023) adds a square-loss-shaped regularizer that fights the over-optimization tendency. The math is similar but the loss penalizes large log π / π_ref ratios beyond what the data supports.

In practice, IPO matters most when:

  • Preference data is noisy (annotators disagree).
  • Many “near-tie” pairs (chosen barely beats rejected).
  • The base model is already strong (less room to grow).

For most tasks, DPO works well. Reach for IPO when DPO is over-fitting on your eval.

KTO — the no-pair version

KTO (Kahneman-Tversky Optimization, ContextualAI 2024) uses unary feedback: one response, one label (good/bad). No pairs needed.

The loss derives from prospect theory (loss aversion); it treats the cost of a “bad” response as proportionally larger than the gain of a “good” one. Training-wise, it looks similar to DPO but each example contributes one term, not two:

# Unary preference: (prompt, response, label∈{good, bad}) if label == 'good': loss = -logsigmoid(β · (logp - logp_ref) - λ_good) else: loss = -logsigmoid(λ_bad - β · (logp - logp_ref))

KTO is the right choice when your production feedback is unary — thumbs-up/down, click vs no-click, success vs failure. You don’t have to construct artificial pairs from production logs. This makes KTO compelling for product-feedback-driven post-training.

What DPO replaced and why

The 2022-era post-training stack:

  1. SFT
  2. Train a reward model (a separate transformer that scores responses).
  3. PPO-RLHF: sample from policy → score with reward model → PPO step → repeat. Includes KL penalty against SFT model.

The 2023+ DPO stack:

  1. SFT
  2. DPO on preference pairs.

That’s it. No reward model, no rollouts, no RL infrastructure. Same paper-level results on standard benchmarks. The simplification was so dramatic that within a year of DPO’s release, most open-source post-training pipelines (Llama-3, Mistral, Qwen 2, DeepSeek series) had switched.

PPO-RLHF still has uses — when reward signal is too sparse or non-verifiable for offline preference data — but for “make the model prefer chosen over rejected” workflows, DPO won decisively.

And then GRPO came along

For reasoning workloads (math, code, formal logic), DPO has limits — it works on the response level, not on intermediate reasoning steps. GRPO (DeepSeek-R1’s recipe — see GRPO & RL Reasoning) brings RL back specifically for verifiable-reward reasoning. The 2026 picture:

  • General preferences (helpfulness, style, safety) → DPO.
  • Verifiable rewards (math, code, exact answers) → GRPO.
  • Most production stacks use both: DPO for chat polish, GRPO for reasoning power.

Run it in your browser — DPO loss simulator

Python — editableCompute the DPO loss for synthetic logprobs; see how β and the policy-vs-reference gap interact.
Ctrl+Enter to run

The output shape — loss decreasing as the policy’s preference gap grows, faster for higher β — is the entire DPO training dynamic in miniature.

Quick check

Fill in the blank
The 2024 preference-optimization variant that uses unary thumbs-up/thumbs-down feedback instead of paired chosen-vs-rejected data:
Three letters; named after a Nobel Prize–winning psychology pair.
Quick check
A team running DPO sees their model improving on the preference benchmark but regressing on MMLU by 4 points. Most likely cause:

Key takeaways

  1. DPO replaced PPO-RLHF as the post-training default. No reward model, no rollouts, no RL.
  2. The loss is closed-form: a sigmoid over the difference of policy-vs-reference log-ratios for chosen and rejected.
  3. β is the KL regularizer. Too low → over-optimization (good on preference, bad on general capability). Tune on held-out eval.
  4. IPO regularizes DPO for noisy / near-tied preference data. KTO uses unary feedback instead of pairs.
  5. DPO + GRPO is the 2026 production combo. DPO for general preferences; GRPO for verifiable-reward reasoning.

Go deeper