I view this post as providing value in three (related) ways:
Making a pedagogical advancement regarding the so-called inner alignment problem
Pointing out that a common view of “RL agents optimize reward” is subtly wrong
Pushing for thinking mechanistically about cognition-updates
Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn’t truly comprehend it—sure, I could parrot back terms like “base optimizer” and “mesa-optimizer”, but it didn’t click. I was confused.
Some months later I read this post and then it clicked.
Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles’ exposition skills that’s a bit overwhelming.
Another part I liked were the phrases “Just because common English endows “reward” with suggestive pleasurable connotations” and “Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.” One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.
Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view.
I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It’s the former view that this post (correctly) argues against. I am sympathetic to pushback of the form “there are arguments that make it reasonable to privilege reward-maximization as a hypothesis” and about this post going a bit too far, but these remarks should not be confused with a rebuttal of the basic point of “cognition-updates are a completely different thing from terminal-goals”.
(A part that has bugged me is that the notion of maximizing reward doesn’t seem to be even well-defined—there are multiple things you could be referring to when you talk about something maximizing reward. See e.g. footnote 82 in the Scheming AIs paper (page 29). Hence taking it for granted that reward is maximized has made me confused or frustrated.)
Re 3: Many of the classical, conceptual arguments about AI risk talk about maximums of objective functions and how those are dangerous. As a result, it’s easy to slide to viewing reinforcement learning policies in terms of maximums of rewards.
I think this is often a mistake. Sure, to first order “trained models get high reward” is a good rule of thumb, and “in the limit of infinite optimization this thing is dangerous” is definitely good to keep in mind. I still think one can do better in terms of descriptive accounts of current models, and I think I’ve got value out of thinking cognition-updates instead of models that maximize reward as well as they can with their limited capabilities.
There are many similarities between inner alignment and “reward is not the optimization target”. Both are sazens, serving as handles for important concepts. (I also like “reward is a cognition-modifier, not terminal-goal”, which I use internally.) Another similarity is that they are difficult to explain. Looking back at the post, I felt some amount of “why are you meandering around instead of just saying the Thing?”, with the immediate next thought being “well, it’s hard to say the Thing”. Indeed, I do not know how to say it better.
Nevertheless, this is the post that made me get it, and there are few posts that I refer to as often as this one. I rank it among the top posts of the year.
Just now saw this very thoughtful review. I share a lot of your perspective, especially:
I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It’s the former view that this post (correctly) argues against.
and
Looking back at the post, I felt some amount of “why are you meandering around instead of just saying the Thing?”, with the immediate next thought being “well, it’s hard to say the Thing”. Indeed, I do not know how to say it better.
My retraction stemmed from uncertainty/insecurity around me not being familiar with the details of the RL algorithms people use, and from conforming to people who disagree with Turner on related topics.
I think this was an overreaction and probably a mistake, though, since I still think that many of the basic points like “it’s not tautologically true that reward will be optimized for” are true and were good to say out loud.
The main thing I wish I had done differently would have been to be more explicit that reward-seeking behavior is totally compatible with the point of the post, and that reward might be an optimization target. I think it’s too easy to read my review and come away thinking that [models behaving in ways you would predict based on the frame/hypothesis “models optimize for reward”] is unlikely (or to think that I think this is unlikely). While I never explicitly made this claim, I wouldn’t blame a casual reader for arriving at that conclusion due to the way my comment was written.
So, to clarify my position: I think that various forms of reward-seeking/reward-optimizing behavior are very likely, and indeed Sonnet 3.7 hacking unit tests is a real-life example of this phenomenon. I’m uncertain of how “deep”/strategic/consequentialist such behavior will eventually be. It’s not tautological that models will care about reward that much, but I also think that it’s a live possibility that they will be well described as optimizing a lot for getting high reward.
With having written down this clarification, I’ll now undo my retraction.
I view this post as providing value in three (related) ways:
Making a pedagogical advancement regarding the so-called inner alignment problem
Pointing out that a common view of “RL agents optimize reward” is subtly wrong
Pushing for thinking mechanistically about cognition-updates
Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn’t truly comprehend it—sure, I could parrot back terms like “base optimizer” and “mesa-optimizer”, but it didn’t click. I was confused.
Some months later I read this post and then it clicked.
Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles’ exposition skills that’s a bit overwhelming.
Another part I liked were the phrases “Just because common English endows “reward” with suggestive pleasurable connotations” and “Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater.” One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.
Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view.
I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It’s the former view that this post (correctly) argues against. I am sympathetic to pushback of the form “there are arguments that make it reasonable to privilege reward-maximization as a hypothesis” and about this post going a bit too far, but these remarks should not be confused with a rebuttal of the basic point of “cognition-updates are a completely different thing from terminal-goals”.
(A part that has bugged me is that the notion of maximizing reward doesn’t seem to be even well-defined—there are multiple things you could be referring to when you talk about something maximizing reward. See e.g. footnote 82 in the Scheming AIs paper (page 29). Hence taking it for granted that reward is maximized has made me confused or frustrated.)
Re 3: Many of the classical, conceptual arguments about AI risk talk about maximums of objective functions and how those are dangerous. As a result, it’s easy to slide to viewing reinforcement learning policies in terms of maximums of rewards.
I think this is often a mistake. Sure, to first order “trained models get high reward” is a good rule of thumb, and “in the limit of infinite optimization this thing is dangerous” is definitely good to keep in mind. I still think one can do better in terms of descriptive accounts of current models, and I think I’ve got value out of thinking cognition-updates instead of models that maximize reward as well as they can with their limited capabilities.
There are many similarities between inner alignment and “reward is not the optimization target”. Both are sazens, serving as handles for important concepts. (I also like “reward is a cognition-modifier, not terminal-goal”, which I use internally.) Another similarity is that they are difficult to explain. Looking back at the post, I felt some amount of “why are you meandering around instead of just saying the Thing?”, with the immediate next thought being “well, it’s hard to say the Thing”. Indeed, I do not know how to say it better.
Nevertheless, this is the post that made me get it, and there are few posts that I refer to as often as this one. I rank it among the top posts of the year.
Just now saw this very thoughtful review. I share a lot of your perspective, especially:
and
@Olli Järviniemi Care to elaborate why you no longer endorse this review?
My retraction stemmed from uncertainty/insecurity around me not being familiar with the details of the RL algorithms people use, and from conforming to people who disagree with Turner on related topics.
I think this was an overreaction and probably a mistake, though, since I still think that many of the basic points like “it’s not tautologically true that reward will be optimized for” are true and were good to say out loud.
The main thing I wish I had done differently would have been to be more explicit that reward-seeking behavior is totally compatible with the point of the post, and that reward might be an optimization target. I think it’s too easy to read my review and come away thinking that [models behaving in ways you would predict based on the frame/hypothesis “models optimize for reward”] is unlikely (or to think that I think this is unlikely). While I never explicitly made this claim, I wouldn’t blame a casual reader for arriving at that conclusion due to the way my comment was written.
So, to clarify my position: I think that various forms of reward-seeking/reward-optimizing behavior are very likely, and indeed Sonnet 3.7 hacking unit tests is a real-life example of this phenomenon. I’m uncertain of how “deep”/strategic/consequentialist such behavior will eventually be. It’s not tautological that models will care about reward that much, but I also think that it’s a live possibility that they will be well described as optimizing a lot for getting high reward.
With having written down this clarification, I’ll now undo my retraction.