We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.