> Third, some AI safety researchers believe that reward hacking is not particularly relevant for the emergence of severe misalignment, and that other kinds of misalignment, like scheming, are worth more study
For what it’s worth I consider “reward hacking where you have to do any kind of deception” as a subset of scheming. My model is that because of the pressure to reward hack by RL, we’ll put up more monitoring / checks / etc which the models will then circumvent sometimes. The clearest cases are something like “directly lying in a text response when you can get away with it because you can fool a weaker grader model”, or “take some clearly unintended shortcut and proactively disable the mechanism that would’ve caught you”.
Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I’ve just edited the post to be more clear about that.
I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.
Great post!
> Third, some AI safety researchers believe that reward hacking is not particularly relevant for the emergence of severe misalignment, and that other kinds of misalignment, like scheming, are worth more study
For what it’s worth I consider “reward hacking where you have to do any kind of deception” as a subset of scheming. My model is that because of the pressure to reward hack by RL, we’ll put up more monitoring / checks / etc which the models will then circumvent sometimes. The clearest cases are something like “directly lying in a text response when you can get away with it because you can fool a weaker grader model”, or “take some clearly unintended shortcut and proactively disable the mechanism that would’ve caught you”.
Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I’ve just edited the post to be more clear about that.
I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.