I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.
The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.
I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.
The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.