RTX 3050 - Order Now
Home / Blog / Tutorials / ORPO vs DPO – Single-Stage vs Two-Stage Alignment
Tutorials

ORPO vs DPO – Single-Stage vs Two-Stage Alignment

ORPO combines SFT and preference optimisation into one stage. DPO runs them separately. Here is when each approach wins.

DPO (Direct Preference Optimisation) is the de facto alignment step after SFT. ORPO (Odds Ratio Preference Optimisation) combines SFT and preference optimisation into one training run. On our dedicated GPU hosting both are viable – the right choice depends on your workflow.

Contents

DPO

Two-stage. First SFT on target-domain data. Then DPO on preference pairs. Clean separation, well-understood, strong published results. Two training runs means two sets of hyperparameters to tune and two opportunities to waste GPU time.

ORPO

Single-stage. Train on preference pairs directly from a base (non-instruction-tuned) model. The ORPO loss combines a supervised signal on the chosen response with a preference term that pushes away from the rejected. One run, one set of hyperparameters.

When Each Wins

DPO wins when:

  • You already have a strong SFT checkpoint
  • Your preference data is limited (DPO with LoRA on 5k pairs works)
  • You want to iterate on alignment without re-doing SFT

ORPO wins when:

  • You are starting from a base model and want one training run
  • GPU budget is tight and doing SFT+DPO twice is expensive
  • You have a larger preference dataset (30k+ pairs)

Config

TRL supports both. ORPO via ORPOTrainer:

from trl import ORPOTrainer, ORPOConfig

trainer = ORPOTrainer(
    model=model,
    args=ORPOConfig(
        output_dir="./orpo-out",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=5e-6,
        beta=0.1,
        num_train_epochs=2,
        bf16=True,
    ),
    train_dataset=ds,
    tokenizer=tok,
)
trainer.train()

Dataset format is the same as DPO: prompt, chosen, rejected.

Single-Stage or Two-Stage Alignment

ORPO or DPO on UK dedicated GPU hosting, with sample datasets preloaded.

Browse GPU Servers

See DPO training for the two-stage path.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?