Skip to content

The Full RLHF Pipeline

RLHF — Reinforcement Learning from Human Feedback — is the recipe that took GPT-3 (interesting but unusable) and produced ChatGPT (a product). It’s three models in a trench coat: an SFT’d policy, a learned reward model, and a PPO loop that tunes the policy against the reward model with a KL anchor against the SFT. The story of post-training from 2022 to 2025 is essentially “what can we replace each of these three components with?” DPO killed the explicit reward model. GRPO killed the value head. Verifiable rewards killed the human labels. But the shape of the pipeline is still the reference architecture, and you’ll see it in every paper.

TL;DR

  • Three stages: (1) Supervised fine-tuning on high-quality demos; (2) train a reward model from pairwise human preferences; (3) PPO-fine-tune the SFT model against the RM with a KL anchor to the SFT.
  • The reward model is a separate network (often the same architecture, last-token classification head): outputs a scalar score r(x,y)r(x,y). Trained with the Bradley-Terry pairwise-preference loss.
  • The PPO loop treats the SFT model as the reference (KL anchor) and as the initialization. The RM stays frozen during PPO.
  • Why three stages: SFT alone teaches the model the kind of response shape we want; RM alone scores responses; PPO is how the model generalizes the RM’s preferences to new prompts the RM was never trained on.
  • Cost: brutal. ~3× a single training run because you’re holding ≥3 models in memory: policy, reference, reward model (and a value head). This is why DPO and GRPO matter.

Why this matters

If you understand the InstructGPT recipe you understand the architecture of every preference-tuned model. DPO is “RLHF without explicit RL”. GRPO is “RLHF without a reward model”. Constitutional AI is “RLHF where AI writes the preferences”. Knowing the original lets you read the simplifications as what they are.

The concept

Stage 1: SFT. Take a pretrained base model. Fine-tune on (prompt,demonstration)(prompt, demonstration) pairs. Cross-entropy loss. Result: a model that responds to prompts in roughly the right form — chatty, follows instructions superficially. ChatGPT-1 quality if you stop here. Demo dataset is small (~10K-100K) and high-quality (paid contractors writing model responses).

Stage 2: Reward model (RM).

  • Collect pairwise preferences from human labelers: given a prompt and two completions, which is better?
  • Train a model rϕ(x,y)r_\phi(x, y) that scores any completion under the Bradley-Terry loss:
LRM=logσ(rϕ(x,yw)rϕ(x,yl))\mathcal{L}_{RM} = -\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

where ywy_w is the preferred completion. Margin is what’s trained, not absolute score.

  • The RM is usually initialized from the SFT model with the LM head swapped for a regression head.
  • Critical: the RM is the bottleneck on RLHF quality. Garbage labels → garbage RM → garbage PPO.

Stage 3: PPO.

  • Initialize policy from SFT. Hold a frozen copy as the reference policy πref\pi_{ref}.
  • For each prompt: sample completion from policy, score with RM, compute advantage.
  • PPO loss with KL anchor:
L=E[PPO clip term]+βKL(πθπref)\mathcal{L} = -\mathbb{E}[\text{PPO clip term}] + \beta \cdot KL(\pi_\theta \| \pi_{ref})
  • β\beta is the anchor strength — high β\beta keeps the policy near SFT, low β\beta lets it roam.

Memory footprint (the engineering nightmare)

For a 7B policy, at full precision in standard implementations:

ComponentApproximate memory
Policy πθ\pi_\theta14 GB (BF16) + optimizer (28 GB Adam)
Reference πref\pi_{ref} (frozen)14 GB
Reward model rϕr_\phi (frozen)14 GB (same size as policy, usually)
Critic / value head~14 GB (if separate model)
Activations during rolloutVariable, can dominate

That’s ~85 GB on a 7B; doesn’t fit on a single H100 (80 GB) without ZeRO or quantization. This is why DPO eliminating the RM and GRPO eliminating the critic are huge engineering wins, not just paper aesthetics.

Mental model

Two frozen models (ref + RM) anchor the training; one trainable model (policy) moves; the loss is “do well on the RM while staying close to the reference.”

Key takeaways

  1. InstructGPT (2022) is the reference architecture. Every newer method is a delta from this.
  2. The RM is the bottleneck. Label quality determines ceiling.
  3. Memory cost is brutal — 3-4 models held simultaneously. The next decade of post-training research is largely about removing models from this stack.
  4. KL anchor β\beta matters more than any other PPO hyperparameter. Too high → no learning. Too low → reward hacking.
  5. Reward hacking is real. Models find ways to game the RM that bypass the spirit of the preferences. Mitigations: better RMs, larger KL, ensemble RMs, RLAIF.

Go deeper

TL;DR

  • 3 stages: SFT → RM (Bradley-Terry) → PPO with KL anchor to SFT.
  • Holds 3-4 models simultaneously (policy, ref, RM, optional critic).
  • RM quality is the ceiling; KL β\beta is the biggest knob.
  • Foundation for every preference-tuned model post-2022.

Why this matters

Reference architecture. Read every newer method as a delta from RLHF.

Concrete walkthrough

Stage 2 — RM loss:

LRM=logσ(rϕ(x,yw)rϕ(x,yl))\mathcal{L}_{RM} = -\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

Stage 3 — full PPO loss:

LPPO=LCLIP+cvLcritic+βKL(πθπref)\mathcal{L}_{PPO} = \mathcal{L}^{CLIP} + c_v \mathcal{L}_{critic} + \beta \cdot KL(\pi_\theta \| \pi_{ref})

with reward signal: r~t=rϕ(x,y)βlog(πθ(y)/πref(y))\tilde{r}_t = r_\phi(x, y) - \beta \log(\pi_\theta(y) / \pi_{ref}(y)).

The KL appears in two places in many implementations — as a reward-shaping term (per-token KL added to RM score) and as an explicit loss term. They’re equivalent in expectation but the per-token version gives cleaner gradients.

Memory budget for 7B (BF16)

ComponentGB
Policy + Adam~42
Reference (frozen)~14
Reward model (frozen)~14
Critic head~14
Activations (peak)20-40
Total~100-120 GB

Engineering implications: ZeRO-3 or FSDP-2 for the policy; quantize ref + RM to INT8 (saves ~14 GB each); offload critic to CPU during rollouts (it’s only used during update step).

Key takeaways

  1. SFT → RM → PPO.
  2. RM = Bradley-Terry on pairwise prefs.
  3. KL anchor against frozen SFT is the load-bearing regularizer.
  4. ~100 GB for a 7B at full precision — the reason DPO/GRPO exist.

Go deeper