OpenRLHF & NeMo-RL Internals

verl is where the latest research lands. OpenRLHF is where you’ll find the cleanest open codebase — easier to read end-to-end, easier to land a first PR. NeMo-RL is NVIDIA’s stack with Megatron-grade distributed training and the published 2.5× speedup paper. All three matter. This lesson is the rest of the production RL landscape — what each is good at, when to pick which, and where the contribution opportunities are.

TL;DR

OpenRLHF (~150 contributors, very active): the cleanest Ray-based open RL trainer. Built around vLLM rollouts and Hugging Face models. Recommended first RL framework to read.
NeMo-RL (NVIDIA): production-scale with Megatron parallelism (TP + PP + EP). Hardest to learn, most powerful at large scale. 2.5× speedup paper from NVIDIA AI Foundations.
TRL (Hugging Face): the easiest on-ramp; not Ray-based; single-node focus. Read TRL first if you’ve never touched RL code. Don’t use it at scale.
Other notables: trlx (CarperAI, mostly archived); Levanter / Marin (Stanford, JAX); AReaL (Tsinghua, async-focused); Hugging Face Open-R1 (recipe-focused, not framework).
How to pick: read OpenRLHF first → graduate to verl for current research → reach for NeMo-RL at thousand-GPU scale.

Why this matters

A serious RL-systems engineer can read all three. The patterns overlap (Ray, single-controller, FSDP+vLLM split), but each has its own conventions. Multi-framework fluency makes you portable across companies.

OpenRLHF

The pitch: open-source production-grade RLHF, with the cleanest code in the space. Ray actors, vLLM rollouts, FSDP-2 trainers. Made by 50+ contributors across academia/industry.

Layout:


OpenRLHF/
├── openrlhf/
│   ├── trainer/
│   │   ├── ray/             # Ray-actor implementations
│   │   │   ├── ppo_actor.py     # PPO trainer Ray actor
│   │   │   ├── vllm_engine.py   # rollout integration
│   │   │   └── launcher.py      # cluster orchestration
│   │   ├── ppo_trainer.py   # single-node PPO
│   │   └── dpo_trainer.py
│   ├── models/              # model + LM head + value head wrappers
│   ├── datasets/
│   └── utils/
├── examples/
│   └── scripts/             # canonical command-line recipes
└── tests/

Reading order for the first 3 days:

examples/scripts/train_ppo_llama_ray.sh — the canonical Ray-based PPO recipe.
openrlhf/trainer/ray/launcher.py — how all the actors get launched.
openrlhf/trainer/ray/ppo_actor.py — the policy actor (FSDP, gradient compute).
openrlhf/trainer/ray/vllm_engine.py — rollout integration.
openrlhf/models/actor.py — policy + value head wrapping.

Contribution opportunities (typical merge time 3-7 days):

A new sampling parameter passthrough.
A new dataset loader in openrlhf/datasets/.
Documentation improvements (very accepted).
A new verifier / reward function.
Fix a vLLM integration edge case.

The maintainers actively engage with new contributors; #openrlhf on the OpenRLHF Discord is responsive.

NeMo-RL

The pitch: NVIDIA’s production RL stack, built on Megatron-LM for the trainer. Designed for thousand-GPU scale. The published 2.5× speedup (NVIDIA blog, 2025) over previous RL frameworks demonstrates the engineering depth.

Architecture (different from verl/OpenRLHF):

Trainer uses Megatron-LM (TP + PP + EP + DP, not just FSDP). Crucial for 70B+ models.
Rollouts use TensorRT-LLM or vLLM (configurable).
Reward model can be a separate service or a Megatron-served model.
Designed for NVIDIA SuperPOD / NeMo Framework integration; less plug-and-play for general infra.

When to pick: model size > 70B, training cluster > 256 GPUs, NVIDIA hardware exclusively. For smaller models or non-NVIDIA setups, the operational complexity isn’t worth it.

Contribution surface: smaller than verl/OpenRLHF because the codebase has tighter coupling to NVIDIA frameworks. PR review by NVIDIA staff. Bar is higher; visibility is also higher.

TRL (Hugging Face)

The pitch: the easiest RL-training library. Not Ray-based. Designed for single-node, accessibility, and inclusion in HF tutorials. Used in 90% of “I tried RLHF in a notebook” demos.

When to use:

Learning. Read the PPO and GRPO trainers as plain Python before tackling Ray-based frameworks.
Single-node experiments < 7B.
Quick prototyping.

When NOT to use: any production setup beyond single-node. TRL’s distributed support is improving but is not the strength of the project.

Key files:

trl/trainer/ppo_trainer.py — the cleanest reference PPO-for-LLM implementation in any open codebase.
trl/trainer/grpo_trainer.py — GRPO implementation.
trl/trainer/dpo_trainer.py — DPO.

Comparison table

Framework	Best for	Backend	Distributed	Code clarity	Contributor pace
TRL	Learning, single-node	FSDP / Accelerate	Limited	Highest	Fast review
OpenRLHF	Production small-medium scale, learning Ray patterns	FSDP + vLLM via Ray	Yes	High	Fast (3-7d)
verl	Current research, GRPO+RLVR, large scale	FSDP/Megatron + vLLM/SGLang	Yes	Medium	Medium (5-14d)
NeMo-RL	Largest scale, NVIDIA stack	Megatron + TRT-LLM	Yes	Medium	Slow (NVIDIA review)

Key takeaways

OpenRLHF is the easiest production-grade RL codebase to read. Start here.
TRL is the gateway drug. Read TRL’s PPO trainer before any Ray-based code.
NeMo-RL is for thousand-GPU NVIDIA-only setups. Higher bar, higher prestige.
Multi-framework fluency is the differentiator. Frontier-lab interviews probe “have you read X’s code?”
First PR targets are the same across frameworks: verifier function, sampling param, recipe config, dataset loader.

Go deeper

RepoOpenRLHFRead openrlhf/trainer/ray/ppo_actor.py first.
PaperHu et al. — OpenRLHF: An Easy-to-use, Scalable RLHF Framework · Hu et al. (2024)The OpenRLHF paper. Section 3 is the Ray architecture.
RepoNVIDIA NeMo-RLNVIDIA's open RL stack. Tightly coupled to NeMo Framework.
BlogNVIDIA — NeMo 2.5× faster RLHFThe speedup announcement. Reading this gives you the architecture choices NVIDIA made.
RepoHuggingFace TRLCleanest reference RL-trainer code. Start with trainer/ppo_trainer.py and grpo_trainer.py.
DocsTRL docsAPI reference. Pair with the source files.
RepoAReaL (Tsinghua)Async-RL-first framework. Less mature, interesting architectural ideas.
RepoLevanter (Stanford)JAX/TPU-focused RL+pretrain. Different ecosystem; worth knowing if you might work at Google.

TL;DR

TRL = easiest, single-node, learning.
OpenRLHF = cleanest production Ray RL.
verl = current research, GRPO+RLVR.
NeMo-RL = NVIDIA, 1K+ GPU scale.

Why this matters

Multi-framework fluency = portable RL engineer.

Concrete walkthrough

Decision tree:


Are you learning RL code for the first time?
  YES → TRL (single-node, plain PyTorch)
  NO ↓

Production / research / scale?
  PRODUCTION small-medium → OpenRLHF
  RESEARCH (GRPO, RLVR, latest)  → verl  
  PRODUCTION 70B+ NVIDIA only    → NeMo-RL

Where the maintainers actually look at PRs (April 2026):

Framework	Active maintainer count	Median review time
TRL	~6 core (HF)	2-5 days
OpenRLHF	~10 active	3-7 days
verl	~15 active (ByteDance + ext.)	5-14 days
NeMo-RL	NVIDIA team (small)	14+ days

Key takeaways

Start with TRL to learn.
OpenRLHF for production patterns.
verl for current research.
NeMo-RL for largest scale.

Go deeper

RepoOpenRLHF
RepoTRL
RepoNeMo-RL