Model Description

Direct Preference Optimization (DPO) is a training method for fine-tuning language models using preference data — pairs of responses labeled as preferred vs rejected — without requiring reinforcement learning or a separate reward model. DPO was introduced in Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Code Structure

This implementation consists of:

  • configs/: YAML configuration files for DPO fine-tuning runs.
  • model.py: Defines the DPO training logic, including the contrastive loss function used to compare chosen and rejected completions.

Available Configurations

ConfigurationDescription
params_zephyr_7b_dpo.yamlDPO training config for a 7B model using preference-labeled instruction tuning data.

References