And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model’s trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you’re training the model on will properly reward-hack from the very beginning. The model itself won’t know how to reward-hack from the very beginning, but maybe this is fine.