What this is
This guide walks through DPO (Direct Preference Optimization) training using the cookbook. DPO learns from preference pairs (chosen vs. rejected responses) without a separate reward model.How DPO differs from GRPO
| DPO | GRPO | |
|---|---|---|
| Trainer jobs | 1 for LoRA, 2 for full-parameter (policy + frozen reference) | 1-2 trainers plus an inference deployment, depending on reference needs |
| Data | Preference pairs (chosen/rejected) | Prompts + reward function |
| Reference logprobs | Cached once at initialization | Computed every step |
| Loss | -log(sigmoid(beta * margin)) | Advantage-weighted policy gradient + KL |
Architecture
Using the recipe
Dataset format
DPO expects preference pairs. Supported formats: Format 1 — chosen/rejected messages:Step-by-step (API-level)
Provision trainers with build_service_client
DPO always needs reference logprobs. Full-parameter DPO uses a policy trainer and a forward-only reference trainer; LoRA DPO uses one policy trainer and the policy session’s shared base reference. Provisioning is owned by the SDK-managed service client — build_service_client resolves shapes, attaches or creates the trainer(s), and decides the reference strategy for you:
- LoRA (
lora_rank > 0) with noreference_training_shape_id→create_reference_clientreuses the policy session (no second trainer). - Full-parameter, or an explicit
reference_training_shape_id→ a separate forward-only reference trainer is provisioned and its lifecycle is owned by the service client.
The cookbook recipes wrap these clients in
ReconnectableClient.from_training_client(...) for blocking semantics; for a raw API-level loop you can call policy_client / reference_client directly.Cache reference logprobs
Reference logprobs are computed once at initialization and reused throughout training:DPO loss function
Training loop
Operational guidance
- Set
trainer.training_shape_idwhen you need an explicit policy shape — otherwise supported recipes auto-select a validated policy shape. - Leave
trainer.reference_training_shape_idunset unless you need a specific reference shape — full-parameter DPO auto-selects a forward-only reference shape; LoRA DPO uses a shared-session reference by default. - DPO does not provision a deployment — there are no rollout samples or deployment weight syncs in the recipe.
- Keep a versioned reference cache tied to tokenizer + base model revision. If the base model changes, recompute reference logprobs.
- Monitor margin statistics: increasing margins indicate the policy is learning preferences.
- DCP checkpoints are disabled by default (
dcp_save_interval=0). If you need to resume training from a checkpoint, setdcp_save_intervaldirectly ondpo_loop.Config.
Common pitfalls
- Mismatched formatting between chosen/rejected sequences corrupts preference signals — ensure identical prompt prefixes.
- Stale reference cache: If you warm-start from a different model, cached reference logprobs are invalid.
Related preference methods
- ORPO (
training.recipes.orpo_loop) — Odds Ratio Preference Optimization. Combines an SFT-style negative-log-likelihood term on the chosen response with a margin term on the odds ratio between chosen and rejected. Unlike DPO, ORPO does not require a reference trainer (no cached reference logprobs), so the recipe runs with a single trainer + dataset of preference pairs. Seetraining.recipes.orpo_loopin the public cookbook repo for the full configuration.
Related guides
- Cookbook RL (GRPO) — reinforcement learning recipes
- Cookbook Reference — all config classes
- Loss Functions — API-level DPO loss details