In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.
Maybe this could be one way to train against a monitor without the bad side effects? If your reward hack monitor flags reward hacking you would add a hack instruction to the prompt. You can also do this for any other bad behavior you can catch with a monitor like deception or sycophancy or uninterpretable CoT.
If you only change the prompts corresponding to the completions flagged by monitors and nothing else, this might capture most of the benefit and have the upside of making your model updates less off-policy and have less of an impact on instruction following. On the other hand, you might miss some kinds of bad behavior and inadvertently reinforce them.
I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.
The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.
Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, …).
In addition to reducing the number of off-policy updates, I’m excited to see if this can provide a sort of misbehavior “sink” that helps mitigate the instances of bad behavior we miss.
I agree—recontextualization seems safer ifwe assume that negative reinforcement from the monitor can “push” the model towards undetectable misbehavior. Whereas, when we recontextualize, we’d never be “pushing” the model towards undetectable misbehavior
Maybe this could be one way to train against a monitor without the bad side effects? If your reward hack monitor flags reward hacking you would add a hack instruction to the prompt. You can also do this for any other bad behavior you can catch with a monitor like deception or sycophancy or uninterpretable CoT.
If you only change the prompts corresponding to the completions flagged by monitors and nothing else, this might capture most of the benefit and have the upside of making your model updates less off-policy and have less of an impact on instruction following. On the other hand, you might miss some kinds of bad behavior and inadvertently reinforce them.
I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.
The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.
This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour
A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v
That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”
See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha
Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, …).
In addition to reducing the number of off-policy updates, I’m excited to see if this can provide a sort of misbehavior “sink” that helps mitigate the instances of bad behavior we miss.
I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can “push” the model towards undetectable misbehavior. Whereas, when we recontextualize, we’d never be “pushing” the model towards undetectable misbehavior