My retraction stemmed from uncertainty/insecurity around me not being familiar with the details of the RL algorithms people use, and from conforming to people who disagree with Turner on related topics.
I think this was an overreaction and probably a mistake, though, since I still think that many of the basic points like “it’s not tautologically true that reward will be optimized for” are true and were good to say out loud.
The main thing I wish I had done differently would have been to be more explicit that reward-seeking behavior is totally compatible with the point of the post, and that reward might be an optimization target. I think it’s too easy to read my review and come away thinking that [models behaving in ways you would predict based on the frame/hypothesis “models optimize for reward”] is unlikely (or to think that I think this is unlikely). While I never explicitly made this claim, I wouldn’t blame a casual reader for arriving at that conclusion due to the way my comment was written.
So, to clarify my position: I think that various forms of reward-seeking/reward-optimizing behavior are very likely, and indeed Sonnet 3.7 hacking unit tests is a real-life example of this phenomenon. I’m uncertain of how “deep”/strategic/consequentialist such behavior will eventually be. It’s not tautological that models will care about reward that much, but I also think that it’s a live possibility that they will be well described as optimizing a lot for getting high reward.
With having written down this clarification, I’ll now undo my retraction.
My retraction stemmed from uncertainty/insecurity around me not being familiar with the details of the RL algorithms people use, and from conforming to people who disagree with Turner on related topics.
I think this was an overreaction and probably a mistake, though, since I still think that many of the basic points like “it’s not tautologically true that reward will be optimized for” are true and were good to say out loud.
The main thing I wish I had done differently would have been to be more explicit that reward-seeking behavior is totally compatible with the point of the post, and that reward might be an optimization target. I think it’s too easy to read my review and come away thinking that [models behaving in ways you would predict based on the frame/hypothesis “models optimize for reward”] is unlikely (or to think that I think this is unlikely). While I never explicitly made this claim, I wouldn’t blame a casual reader for arriving at that conclusion due to the way my comment was written.
So, to clarify my position: I think that various forms of reward-seeking/reward-optimizing behavior are very likely, and indeed Sonnet 3.7 hacking unit tests is a real-life example of this phenomenon. I’m uncertain of how “deep”/strategic/consequentialist such behavior will eventually be. It’s not tautological that models will care about reward that much, but I also think that it’s a live possibility that they will be well described as optimizing a lot for getting high reward.
With having written down this clarification, I’ll now undo my retraction.