The Direct Preference Optimization (DPO) paper promises a more simple and efficient alternative to PPO that is able to void the reward modeling phase, and thus optimize directly for the preferences expressed in the preference data. This is achieved by the loss function:
πθ(yw|x) and πθ(yl|x) are the probabilities of the preferred and dispreferred completions under the current model.
E(x,yw,yl)∼D denotes the expectation over the dataset of preferences D.
β is a parameter controlling the deviation from the base reference policy πref.
In essence, DPO computes the log probabilities of preferred and dispreferred completions under the current model and optimizes parameters to increase the likelihood of the preferred completions and decrease the likelihood of the dispreferred completions.
The authors share the following results:
We evaluate different methods by sampling completions on the test split of TL;DR summarization dataset, and computing the average win rate against reference completions in the test set. The completions for all methods are sampled at temperatures varying from 0.0 to 1.0, and the win rates are shown in Figure 2 (right). DPO, PPO and Preferred-FT all fine-tune the same GPT-J SFT model. We find that DPO has a win rate of approximately 61% at a temperature of 0.0, exceeding the performance of PPO at 57% at its optimal sampling temperature of 0.0. DPO also achieves a higher maximum win rate compared to the best of N baseline. We note that we did not meaningfully tune DPO’s β hyperparameter, so these results may underestimate DPO’s potential. Moreover, we find DPO to be much more robust to the sampling temperature than PPO, the performance of which can degrade to that of the base GPT-J model at high temperatures.
Direct Preference Optimization in One Minute
The Direct Preference Optimization (DPO) paper promises a more simple and efficient alternative to PPO that is able to void the reward modeling phase, and thus optimize directly for the preferences expressed in the preference data. This is achieved by the loss function:
LDPO(πθ;πref)=−E(x,yw,yl)∼D[logσ(βlogπθ(yw|x)πref(yw|x)−βlogπθ(yl|x)πref(yl|x))]
Where:
x is some prompt
πθ(yw|x) and πθ(yl|x) are the probabilities of the preferred and dispreferred completions under the current model.
E(x,yw,yl)∼D denotes the expectation over the dataset of preferences D.
β is a parameter controlling the deviation from the base reference policy πref.
In essence, DPO computes the log probabilities of preferred and dispreferred completions under the current model and optimizes parameters to increase the likelihood of the preferred completions and decrease the likelihood of the dispreferred completions.
The authors share the following results:
Emphasis mine.
I have created a prediction market targeted at forecasting the likelihood that DPO is adopted by a Frontier Lab before Jan 1 2025. These results should be reproduced, which I might attempt on the EleutherAI Discord, in which case I will update this section of the post. Contact me if this is of interest to you.