Joey Yudelson comments on Jesse Hoogland’s Shortform

Joey Yudelson 25 Feb 2025 22:44 UTC
18 points
16
To me this doesn’t seem like a failure of sophisticated reward models, it’s the failure of unsophisticated reward models (unit tests) when they’re being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was “really doing the work”, this sort of special-casing would get totally trained out.
(Not claiming that this is always the case, or that models couldn’t be deceptive here, or that e.g. 3.8 couldn’t reward hack 3.7)