DPO (Direct Preference Optimisation) is the de facto alignment step after SFT. ORPO (Odds Ratio Preference Optimisation) combines SFT and preference optimisation into one training run. On our dedicated GPU hosting both are viable – the right choice depends on your workflow.
Contents
DPO
Two-stage. First SFT on target-domain data. Then DPO on preference pairs. Clean separation, well-understood, strong published results. Two training runs means two sets of hyperparameters to tune and two opportunities to waste GPU time.
ORPO
Single-stage. Train on preference pairs directly from a base (non-instruction-tuned) model. The ORPO loss combines a supervised signal on the chosen response with a preference term that pushes away from the rejected. One run, one set of hyperparameters.
When Each Wins
DPO wins when:
- You already have a strong SFT checkpoint
- Your preference data is limited (DPO with LoRA on 5k pairs works)
- You want to iterate on alignment without re-doing SFT
ORPO wins when:
- You are starting from a base model and want one training run
- GPU budget is tight and doing SFT+DPO twice is expensive
- You have a larger preference dataset (30k+ pairs)
Config
TRL supports both. ORPO via ORPOTrainer:
from trl import ORPOTrainer, ORPOConfig
trainer = ORPOTrainer(
model=model,
args=ORPOConfig(
output_dir="./orpo-out",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-6,
beta=0.1,
num_train_epochs=2,
bf16=True,
),
train_dataset=ds,
tokenizer=tok,
)
trainer.train()
Dataset format is the same as DPO: prompt, chosen, rejected.
Single-Stage or Two-Stage Alignment
ORPO or DPO on UK dedicated GPU hosting, with sample datasets preloaded.
Browse GPU ServersSee DPO training for the two-stage path.