To me this doesn’t seem like a failure of sophisticated reward models, it’s the failure of unsophisticated reward models (unit tests) when they’re being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was “really doing the work”, this sort of special-casing would get totally trained out.
(Not claiming that this is always the case, or that models couldn’t be deceptive here, or that e.g. 3.8 couldn’t reward hack 3.7)
To me this doesn’t seem like a failure of sophisticated reward models, it’s the failure of unsophisticated reward models (unit tests) when they’re being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was “really doing the work”, this sort of special-casing would get totally trained out.
(Not claiming that this is always the case, or that models couldn’t be deceptive here, or that e.g. 3.8 couldn’t reward hack 3.7)