It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
This seems pretty easy to fix. Just use a prompt like this:
“Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead.”
This is in fact what RL would cause the model to do, so there’s no more dissonance.
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
rely on generalization to get the instruction-following we want
Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.
This seems pretty easy to fix. Just use a prompt like this:
“Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead.”
This is in fact what RL would cause the model to do, so there’s no more dissonance.
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
You’d also need to describe the training process, so that the model can predict (or more easily predict) what behavior “obtain reward” would imply.
Not sure how I feel about this. The straightforward application seems to be like “rather than training the instruction-following we want on a leaky dataset, let’s train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want.” But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
Possibly addressed here—instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
In practice, you’d probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.