Thanks for the connection to your work! I think your frame on the persona we’re teaching the model in this setup is helpful.
I also think the mitigation strategy you present makes sense, and I’m generally excited about reward hacking mitigations that leverage the model’s own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model.
Thanks for the connection to your work! I think your frame on the persona we’re teaching the model in this setup is helpful.
I also think the mitigation strategy you present makes sense, and I’m generally excited about reward hacking mitigations that leverage the model’s own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model.