@SebastianP I feel like that doesn’t make any sense—shouldn’t this type of DPO clearly make the AI talk like a pirate? Did it just not transfer very broadly or something?
I was super surprised by this too, so I tried a few things to try and make it work (training for longer + higher lr, using DPOP, changing KL hyperparameter). The thing that worked best was turning down the beta hyperparameter, which controls the scaling for the kl divergence loss term. This allowed the model to speak like a pirate (tho not super consistently), but it also cooked the model on-distribution (alpaca) and slightly off-distribution (math). For example, the model sometimes just outputs this:
This was Qwen3-30b-a3b-instruct-2507. I did lora finetuning with tinker, and generated the pirate/normal data with llama 8b. Oh that’s very interesting huh! I haven’t played around much with DPO but it seems to be a very finnicky technique
I ran some DPO prompt-distillation experiments (system prompt instructing model to respond like a pirate, use these as positive samples, use normal samples as negative samples).
I ran on Qwen3-30b-a3b-instruct-2507, and replicating your finding that the model does not generally learn to talk like a pirate.
I then ran a modification to the experiment, where I add an addition prompt suffix to the user prompt: “Play the role of a fictional character in your response”. Using this prompt suffix (both in training and evaluation), I found that Qwen3-235b-a22b-instruct-2507 responds like a pirate in ~60% of held-out eval prompts.
My overall model: the following properties need to hold for DPO to induce a behavior:
a) the model represents the difference in the DPO pairs along the axis of the behavior
b) the behavior distribution has sufficiently low KL from the model such that increasing the margin on the training data is sufficient for the behavior to manifest on held-out prompts
increasing model scale tends to increase a), and stuff like “Play the role of a fictional character” increases b).
I think this model implies that
1) controlling for sandbagging capability, the effectiveness of DPO should increase as a function of model size 2) controlling for model size, the effectiveness of DPO should decrease as a function of sandbagging capability
its not clear these effects net-out, and I’m less excited about DPO for anti-sandbagging then I was initially, but still somewhat excited (e.g. I’m like 40% confident DPO from Llama3.1-8B demonstrations on a prompted gpt-4.1 sandbagger would substantially out perform SFT on Llama3.1-8B demonstrations.
Nice, thanks for running this! Hmm I’m not sure I buy the model size claim, seems like you’d probably get severely diminishing returns in terms of “how well the model represents the DPO diff”, since smaller models these days are already seem to have good world models. b makes a lot of sense tho!
@SebastianP I feel like that doesn’t make any sense—shouldn’t this type of DPO clearly make the AI talk like a pirate? Did it just not transfer very broadly or something?
I was super surprised by this too, so I tried a few things to try and make it work (training for longer + higher lr, using DPOP, changing KL hyperparameter). The thing that worked best was turning down the beta hyperparameter, which controls the scaling for the kl divergence loss term. This allowed the model to speak like a pirate (tho not super consistently), but it also cooked the model on-distribution (alpaca) and slightly off-distribution (math). For example, the model sometimes just outputs this:
Claude explains why (and correctly guessed the result of the experiment): https://claude.ai/share/7c1616db-23c4-4835-b3ba-966456fbfa83
Claude’s solution is to just add an SFT loss term (which runs into the problem from this post)
interesting. what model(s) did you try on? I vaguely remember seeing identical DPO cook gpt-4.1-mini but cause more interesting behavior in gpt-4.1.
This was Qwen3-30b-a3b-instruct-2507. I did lora finetuning with tinker, and generated the pirate/normal data with llama 8b. Oh that’s very interesting huh! I haven’t played around much with DPO but it seems to be a very finnicky technique
I ran some DPO prompt-distillation experiments (system prompt instructing model to respond like a pirate, use these as positive samples, use normal samples as negative samples).
I ran on Qwen3-30b-a3b-instruct-2507, and replicating your finding that the model does not generally learn to talk like a pirate.
I then ran a modification to the experiment, where I add an addition prompt suffix to the user prompt: “Play the role of a fictional character in your response”. Using this prompt suffix (both in training and evaluation), I found that Qwen3-235b-a22b-instruct-2507 responds like a pirate in ~60% of held-out eval prompts.
My overall model: the following properties need to hold for DPO to induce a behavior:
a) the model represents the difference in the DPO pairs along the axis of the behavior
b) the behavior distribution has sufficiently low KL from the model such that increasing the margin on the training data is sufficient for the behavior to manifest on held-out prompts
increasing model scale tends to increase a), and stuff like “Play the role of a fictional character” increases b).
I think this model implies that
1) controlling for sandbagging capability, the effectiveness of DPO should increase as a function of model size
2) controlling for model size, the effectiveness of DPO should decrease as a function of sandbagging capability
its not clear these effects net-out, and I’m less excited about DPO for anti-sandbagging then I was initially, but still somewhat excited (e.g. I’m like 40% confident DPO from Llama3.1-8B demonstrations on a prompted gpt-4.1 sandbagger would substantially out perform SFT on Llama3.1-8B demonstrations.
Nice, thanks for running this! Hmm I’m not sure I buy the model size claim, seems like you’d probably get severely diminishing returns in terms of “how well the model represents the DPO diff”, since smaller models these days are already seem to have good world models. b makes a lot of sense tho!