Rohin Shah comments on Training a Reward Hacker Despite Perfect Labels

Rohin Shah 16 Aug 2025 15:00 UTC
LW: 21 AF: 12
2
AF
I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.