Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I’ve just edited the post to be more clear about that.
I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.
Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I’ve just edited the post to be more clear about that.
I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.