Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, …).
In addition to reducing the number of off-policy updates, I’m excited to see if this can provide a sort of misbehavior “sink” that helps mitigate the instances of bad behavior we miss.
Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, …).
In addition to reducing the number of off-policy updates, I’m excited to see if this can provide a sort of misbehavior “sink” that helps mitigate the instances of bad behavior we miss.