To me this doesn’t seem like a failure of sophisticated reward models, it’s the failure of unsophisticated reward models (unit tests) when they’re being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was “really doing the work”, this sort of special-casing would get totally trained out.
(Not claiming that this is always the case, or that models couldn’t be deceptive here, or that e.g. 3.8 couldn’t reward hack 3.7)
1 vote
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
This was really beautiful. Thanks for writing.