Cool work! Harder scheming samples are valuable for better assessing monitoring capability, and I like the fact that your refinement pipeline does not depend on adversarial iteration against a specific monitor.
I agree that risk assessment/calibration would be high impact in improving monitors. Given the detection vs risk assessment plots and the strong FPR dependency in the prompt sensitivity table, I would be curious to know what the false positive samples contain and how they compare to the false negative samples you discuss.
Also, see the last point in this comment from Fabien Roger about sharing red-teaming findings.
Victor Gillioz
Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, …).
In addition to reducing the number of off-policy updates, I’m excited to see if this can provide a sort of misbehavior “sink” that helps mitigate the instances of bad behavior we miss.
That could be an interesting variation! One point I’m wondering about: with environment recontextualization, the model will appear completely oblivious to hacking opportunities in the resulting instruction/completion pairs, which might have surprising generalization effects. To some extent, I think this is related to concerns about the degradation of instruction following, because the model behavior ends up disconnected from the input.
Thanks for testing and sharing this. Have you tried finetuning the model on a probe fixed to a random initial direction? Or training the probe and model at the same time? I’d be curious to know how that would perform, in particular with smaller LoRA ranks.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.