The first bullet point here is what I see as the most important factor for why current AI doesn’t seek extreme power: they are best thought of not as being intrinsically motivated to complete tasks, but rather as having a reflex to complete contexts in a human-like way.
Maybe RL focusses this reflex and adds some degree of motivation to models, but I doubt this effect is large. My reasoning for this is that the default behavior of a pretrained model is to act like it is pursuing a goal when the inputted context suggests this, so there is little reward/gradient pressure to instill additional goal-pursuing drive.
I think that the explanation for why DPO suppresses VEA despite the chosen set having more VEA than the reject set may be incomplete.
In a highly simplified setting where we think of the accept and reject data as being independent Bernoulli random variables (e.g., corresponding to VEA) and we learn a single probability parameter , DPO does not push toward the chosen marginal rate . If the accept datapoints are and reject datapoints are , then the global minimum of the DPO loss satisfies
This shows that if then the parameter value that minimises the DPO loss is greater than the reference value , even if is less than . (Because is the condition for the second term above being positive.)
So if DPO’s chosen responses have more VEA than rejected responses, this suggests that it should increase VEA relative to the reference. (And that the absolute rate of VEA in the DPO data doesn’t matter, unlike what is proposed in the post.) The fact that the released DPO checkpoint has lower VEA seems to require another explanation: maybe the prompt dependence matters, maybe it’s because it’s an entire answer being reinforced and it might have other correlations.