Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting “The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal”.
“The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization” is a good summary of my position. You don’t say much that directly contradicts this, though I do think that even using the “terminal reward seeker” vs “schemer” distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it’ll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.
Aside from that I’d add a few clarifications (nothing major):
The process of deciding on a new supergoal will probably involve systematizing not just “maximize reward” but also a bunch of other drives too—including ones which had previously been classified as special cases of “maximize reward” (e.g. “make humans happy”) but upon reflection are more naturally understood as special cases of the new supergoal.
It seems like you implicitly assume that the supergoal will be “in charge”. But I expect that there will be a bunch of conflict between supergoal and lower-level goals, analogous to the conflict between different layers of an organizational hierarchy (or between a human’s System 2 motivations and System 1 motivations). I call the spectrum from “all power is at the top” to “all power is at the bottom” the systematizing-conservatism spectrum.
I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.
Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting “The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal”.
“The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization” is a good summary of my position. You don’t say much that directly contradicts this, though I do think that even using the “terminal reward seeker” vs “schemer” distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it’ll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.
Aside from that I’d add a few clarifications (nothing major):
The process of deciding on a new supergoal will probably involve systematizing not just “maximize reward” but also a bunch of other drives too—including ones which had previously been classified as special cases of “maximize reward” (e.g. “make humans happy”) but upon reflection are more naturally understood as special cases of the new supergoal.
It seems like you implicitly assume that the supergoal will be “in charge”. But I expect that there will be a bunch of conflict between supergoal and lower-level goals, analogous to the conflict between different layers of an organizational hierarchy (or between a human’s System 2 motivations and System 1 motivations). I call the spectrum from “all power is at the top” to “all power is at the bottom” the systematizing-conservatism spectrum.
I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.