Richard_Ngo comments on Training a Reward Hacker Despite Perfect Labels

Richard_Ngo 16 Aug 2025 9:19 UTC
LW: 9 AF: 7
3
AF
By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.
Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that’s the bit where the “reward reinforces previous computations” frame was most helpful.
(I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
- Rohin Shah 16 Aug 2025 15:00 UTC
  LW: 21 AF: 12
  2
  AF Parent
  I think Oliver’s pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)
  The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.