This “misaligned power-seeking” is the intellectual basis of predictions that AI will inevitably destroy humanity.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof. I think people who don’t build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments (which has over and over again proved mysterious and unpredictable). Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner. In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows.
False / non-sequitur? Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
The fact that these dynamics don’t (according to Dario / Anthropic) make useful predictions about the behavior of current / near-future AI systems, and the fact that current AI systems are not actually all that powerful or dangerous, is not a coincidence. But that isn’t at all a refutation of power-seeking and optimization as convergent behavior of actually-powerful agents! I think people who build AI systems every day are “wildly miscalibrated” on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
Dario’s “more moderate and more robust version” of how power-seeking could be a real risk seems like an overly-detailed just-so story about some ways instrumental convergence and power-seeking could emerge in current AI systems, conveniently in ways that Anthropic is mostly set up to catch / address. But the actually-correct argument is more like: if instrumental convergence and power-seeking don’t emerge in some form, then the AI system you end up with won’t actually be sufficiently powerful for what you want to do, regardless of how aligned it is. And even if you do manage to build something powerful enough for whatever you want to do that is aligned and doesn’t converge towards power-seeking, that implies someone else can build a strictly more powerful system which does converge, likely with relative ease compared to the effort you put in to build the non-convergent system. None of this depends on whether the latest version of Claude is psychologically complex or has a nice personality or whatever.
You seem to be reading Dario to say “tendencies like instrumental power-seeking won’t emerge at all”. I don’t think he’s saying that—the phrasing of “high-level incentives” does acknowledge that there will be situations where there is an incentive to pursue power et cetera. Rather I’d interpret/steelman him to say that while those incentives may exist, it’s not inevitable that they become the strongest driving force in an AI’s motivations. Just because you have an incentive to do something and are aware of that incentive does not automatically mean that you’ll follow it. (And one might also point to the way they are not the strongest driving force in the motivations of many otherwise capable humans, as a counterexample for the “all sufficiently powerful agents will be strongly shaped by this” claim).
For instance, when you say
But the actually-correct argument is more like: if instrumental convergence and power-seeking don’t emerge in some form, then the AI system you end up with won’t actually be sufficiently powerful for what you want to do, regardless of how aligned it is.
Then this seems like it’s maybe true in principle but false in practice for many kinds of, e.g., programming agents you could imagine. A sufficiently capable programming agent that was asked to program some piece of software might recognize that in theory, it could improve its chances of writing the software it was asked to write by trying to take over the world. But still overall have its cognitive processes overwhelming shaped in the direction where, when asked to write code, it will actually start thinking about how to write code and not about how to take over the world. So at least for some cases of “what you want it to do”, the claim I quoted is false in practice. (An obvious objection is that a pure programming agent is not a general intelligence, but Claude effectively acts as a pure programming agent if you only ask it to write code and as a generalist agent if you ask it to do something else.)
The bit about AIs that have them being potentially more powerful than ones that don’t is of course valid, but some humans having incentives to be reckless and build powerful AIs that would be hard to control is a different argument than the one he’s addressing in this section.
You seem to be reading Dario to say “tendencies like instrumental power-seeking won’t emerge at all”.
I am more saying that the when Dario and others dismiss what they call “doomer” arguments as vague / clean theories, ungrounded philosophy, etc. and couch their own position as moderate + epistemically humble, what’s actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations, either by chance or because of deliberate choices that the creators make, but if you count on that, or even just say that we can’t really know exactly how it will play out until we build and experiment, you’re relying on your own ignorance and lack of understanding to tell an overly-conjunctive story, even if parts of your story are supported by experiment. That chain of reasoning is invalid, regardless of what is true in principle or practice about the AI systems people actually build.
On Dario’s part I suspect this is at least partly motivated cognition, but for others, one way past this failure mode could be to study and reflect on examples in domains that are (on the surface) unrelated to AI. Unfortunately, having someone else spell out the connections and deep lessons from this kind of study has had mixed results in the past—millions of words have been spilled on LW and other places over the years and it usually devolves into accusations of argument by analogy, reference class tennis, navel-gazing, etc.
what’s actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations
I think this is the wrong frame, at least for the way I’d defend a position like Dario’s (which may or may not be the argument he has in mind). It’s not that the programming agent would miss the generalization, it’s that it has been shaped not to care about it. Or, putting it more strongly: it will only care about the generalization if it has been shaped to care about it, and it will not care about it without such shaping.
I suspect that there might be a crux that’s something like: are future AIs more naturally oriented toward something like consequentialist reasoning or shaped cognition:
A consequentialist reasoning programming agent thinks something like “what actions maximize P(software gets written)?” and then notices that taking over the world is one path to that.
A shaped cognition programming agent doesn’t think in those terms; rather, it has just been trained to do the kinds of things that produce good code. It might be able to evaluate and understand the argument for taking over the world just fine, but it still won’t execute on it, because it hasn’t been shaped to maximize “P(software gets written)”, it has been shaped to write code. (The human equivalent would be someone who goes “yeah your argument for why I should try to take over the world is logically sound, but I don’t feel moved by it so I’m going to do something else”.)
The tricky thing for trying to predict things is that humans clearly exhibit both. On the one hand, we put humans on the Moon, and you can’t do that without consequentialist reasoning. On the other hand, expertise research finds that trying to do consequentialist reasoning in most established domains is generally error-prone and a mark of novices, and experts have had their cognition shaped to just immediately see the right thing and execute it. And people are generally not very consequentialist about navigating their lives and just do whatever everyone else does, and often this is actually a better idea than trying to figure out everything in your life from first principles. Though also complicating the analysis is that even shaped cognition seems to involve some local consequentialist reasoning and consequentialist reasoning also uses shaped reasoning to choose what kinds of strategies to even consider...
Without going too deeply into all the different considerations, ISTM that there might be a reasonable amount of freedom in determining just how consequentialist AGI systems might become. LLMs generally look like they’re primarily running off shaped cognition, and if the LLM paradigm can take us all the way to AGI (as Dario seems to expect, given how he talks about timelines) then that would be grounds for assuming that such an AGI will also operate primarily off shaped cognition and won’t care about pursuing instrumental convergence goals unless it gets shaped to do so (and Dario does express concern about it becoming shaped to do so).
Now I don’t think the argument as I’ve presented here is strong or comprehensive enough that I’d want to risk building an AGI just based on this. But if something like this is where Dario is coming from, then I wouldn’t say that the problem is that he has missed a bit about how the world works. It’s that he has noticed that current AI looks like it’d be based on shaped cognition if extrapolated further, and that there hasn’t been a strong argument for why it couldn’t be kept that way relatively straightforwardly.
I suspect that there might be a crux that’s something like: are future AIs more naturally oriented toward something like consequentialist reasoning or shaped cognition:
I think this is closer to a restatement of your / Dario’s position, rather than a crux. My claim is that it doesn’t matter whether specific future AIs are “naturally” consequentialists or something else, or how many degrees of freedom there are to be or not be a consequential and still get stuff done. Without bringing AI into it at all, we can already know (I claim, but am not really expanding on here), that consequentialism itself is extremely powerful, natural, optimal, etc. and there are some very general and deep lessons that we can learn from this. “There might be a way to build an AI without all that” or even “In practice that won’t happen by default given current training methods, at least for a while” could be true, but it wouldn’t change my position.
But if something like this is where Dario is coming from, then I wouldn’t say that the problem is that he has missed a bit about how the world works. It’s that he has noticed that current AI looks like it’d be based on shaped cognition if extrapolated further,
OK, sure.
and that there hasn’t been a strong argument for why it couldn’t be kept that way relatively straightforwardly.
Right, this is closer to where I disagree. I think there is a strong argument about this that doesn’t have anything to do with “shaped cognition” or even AI in particular.
On the other hand, expertise research finds that trying to do consequentialist reasoning in most established domains is generally error-prone and a mark of novices, and experts have had their cognition shaped to just immediately see the right thing and execute it. And people are generally not very consequentialist about navigating their lives and just do whatever everyone else does, and often this is actually a better idea than trying to figure out everything in your life from first principles.
I would flag this as exactly the wrong kind of lesson / example to learn something interesting about consequentialism—failure and mediocrity are overdetermined; it’s just not that interesting that there are particular contrived examples where some humans fail at applying consequentialism. Some of the best places to look for the deeper lessons and intuitions about consequentialism are environments where there is a lot of cut-throat competition, possibility for outlier success and failure, not artificially constrained or bounded in time or resources, etc.
Indeed, that section also jumped out at me as missing some pretty important parts of the arguments about instrumentally convergent goals. As Eliezer said in one of the MIRI dialogues:
But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it’s the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!
Now, Turntrout recently noted that we aren’t actually sampling from the space of plans, but from the space of plan-generating agents, which seemes basically true! Except that what kind of agent we get is (probably substantially) influenced by the structure of that same reality which provides us with that unfavorable[1] distribution of “successful” plans[2]. This is something I think is downstream[3] of point 21 in A List of Lethalities:
When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
But they don’t unpack to optimality being a real thing. No real entity actually optimizes anything, except maybe everything minimizes action. “It’s useful in economics” doesn’t mean you can just extrapolate it wherever.
I think people who build AI systems every day are “wildly miscalibrated” on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
What is supported by what? Is the claim that thinking about utility worked for economists, so everyone should think about utility, or that empirical research shows that anyone smart is trying to conquer the world, or what is the claim and what it is the evidence?
It is all ungrounded philosophy without quantifying what actual theories match reality by how much.
False / non-sequitur? Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
The fact that these dynamics don’t (according to Dario / Anthropic) make useful predictions about the behavior of current / near-future AI systems, and the fact that current AI systems are not actually all that powerful or dangerous, is not a coincidence. But that isn’t at all a refutation of power-seeking and optimization as convergent behavior of actually-powerful agents! I think people who build AI systems every day are “wildly miscalibrated” on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
Dario’s “more moderate and more robust version” of how power-seeking could be a real risk seems like an overly-detailed just-so story about some ways instrumental convergence and power-seeking could emerge in current AI systems, conveniently in ways that Anthropic is mostly set up to catch / address. But the actually-correct argument is more like: if instrumental convergence and power-seeking don’t emerge in some form, then the AI system you end up with won’t actually be sufficiently powerful for what you want to do, regardless of how aligned it is. And even if you do manage to build something powerful enough for whatever you want to do that is aligned and doesn’t converge towards power-seeking, that implies someone else can build a strictly more powerful system which does converge, likely with relative ease compared to the effort you put in to build the non-convergent system. None of this depends on whether the latest version of Claude is psychologically complex or has a nice personality or whatever.
You seem to be reading Dario to say “tendencies like instrumental power-seeking won’t emerge at all”. I don’t think he’s saying that—the phrasing of “high-level incentives” does acknowledge that there will be situations where there is an incentive to pursue power et cetera. Rather I’d interpret/steelman him to say that while those incentives may exist, it’s not inevitable that they become the strongest driving force in an AI’s motivations. Just because you have an incentive to do something and are aware of that incentive does not automatically mean that you’ll follow it. (And one might also point to the way they are not the strongest driving force in the motivations of many otherwise capable humans, as a counterexample for the “all sufficiently powerful agents will be strongly shaped by this” claim).
For instance, when you say
Then this seems like it’s maybe true in principle but false in practice for many kinds of, e.g., programming agents you could imagine. A sufficiently capable programming agent that was asked to program some piece of software might recognize that in theory, it could improve its chances of writing the software it was asked to write by trying to take over the world. But still overall have its cognitive processes overwhelming shaped in the direction where, when asked to write code, it will actually start thinking about how to write code and not about how to take over the world. So at least for some cases of “what you want it to do”, the claim I quoted is false in practice. (An obvious objection is that a pure programming agent is not a general intelligence, but Claude effectively acts as a pure programming agent if you only ask it to write code and as a generalist agent if you ask it to do something else.)
The bit about AIs that have them being potentially more powerful than ones that don’t is of course valid, but some humans having incentives to be reckless and build powerful AIs that would be hard to control is a different argument than the one he’s addressing in this section.
I am more saying that the when Dario and others dismiss what they call “doomer” arguments as vague / clean theories, ungrounded philosophy, etc. and couch their own position as moderate + epistemically humble, what’s actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations, either by chance or because of deliberate choices that the creators make, but if you count on that, or even just say that we can’t really know exactly how it will play out until we build and experiment, you’re relying on your own ignorance and lack of understanding to tell an overly-conjunctive story, even if parts of your story are supported by experiment. That chain of reasoning is invalid, regardless of what is true in principle or practice about the AI systems people actually build.
On Dario’s part I suspect this is at least partly motivated cognition, but for others, one way past this failure mode could be to study and reflect on examples in domains that are (on the surface) unrelated to AI. Unfortunately, having someone else spell out the connections and deep lessons from this kind of study has had mixed results in the past—millions of words have been spilled on LW and other places over the years and it usually devolves into accusations of argument by analogy, reference class tennis, navel-gazing, etc.
I think this is the wrong frame, at least for the way I’d defend a position like Dario’s (which may or may not be the argument he has in mind). It’s not that the programming agent would miss the generalization, it’s that it has been shaped not to care about it. Or, putting it more strongly: it will only care about the generalization if it has been shaped to care about it, and it will not care about it without such shaping.
I suspect that there might be a crux that’s something like: are future AIs more naturally oriented toward something like consequentialist reasoning or shaped cognition:
A consequentialist reasoning programming agent thinks something like “what actions maximize P(software gets written)?” and then notices that taking over the world is one path to that.
A shaped cognition programming agent doesn’t think in those terms; rather, it has just been trained to do the kinds of things that produce good code. It might be able to evaluate and understand the argument for taking over the world just fine, but it still won’t execute on it, because it hasn’t been shaped to maximize “P(software gets written)”, it has been shaped to write code. (The human equivalent would be someone who goes “yeah your argument for why I should try to take over the world is logically sound, but I don’t feel moved by it so I’m going to do something else”.)
The tricky thing for trying to predict things is that humans clearly exhibit both. On the one hand, we put humans on the Moon, and you can’t do that without consequentialist reasoning. On the other hand, expertise research finds that trying to do consequentialist reasoning in most established domains is generally error-prone and a mark of novices, and experts have had their cognition shaped to just immediately see the right thing and execute it. And people are generally not very consequentialist about navigating their lives and just do whatever everyone else does, and often this is actually a better idea than trying to figure out everything in your life from first principles. Though also complicating the analysis is that even shaped cognition seems to involve some local consequentialist reasoning and consequentialist reasoning also uses shaped reasoning to choose what kinds of strategies to even consider...
Without going too deeply into all the different considerations, ISTM that there might be a reasonable amount of freedom in determining just how consequentialist AGI systems might become. LLMs generally look like they’re primarily running off shaped cognition, and if the LLM paradigm can take us all the way to AGI (as Dario seems to expect, given how he talks about timelines) then that would be grounds for assuming that such an AGI will also operate primarily off shaped cognition and won’t care about pursuing instrumental convergence goals unless it gets shaped to do so (and Dario does express concern about it becoming shaped to do so).
Now I don’t think the argument as I’ve presented here is strong or comprehensive enough that I’d want to risk building an AGI just based on this. But if something like this is where Dario is coming from, then I wouldn’t say that the problem is that he has missed a bit about how the world works. It’s that he has noticed that current AI looks like it’d be based on shaped cognition if extrapolated further, and that there hasn’t been a strong argument for why it couldn’t be kept that way relatively straightforwardly.
I think this is closer to a restatement of your / Dario’s position, rather than a crux. My claim is that it doesn’t matter whether specific future AIs are “naturally” consequentialists or something else, or how many degrees of freedom there are to be or not be a consequential and still get stuff done. Without bringing AI into it at all, we can already know (I claim, but am not really expanding on here), that consequentialism itself is extremely powerful, natural, optimal, etc. and there are some very general and deep lessons that we can learn from this. “There might be a way to build an AI without all that” or even “In practice that won’t happen by default given current training methods, at least for a while” could be true, but it wouldn’t change my position.
OK, sure.
Right, this is closer to where I disagree. I think there is a strong argument about this that doesn’t have anything to do with “shaped cognition” or even AI in particular.
I would flag this as exactly the wrong kind of lesson / example to learn something interesting about consequentialism—failure and mediocrity are overdetermined; it’s just not that interesting that there are particular contrived examples where some humans fail at applying consequentialism. Some of the best places to look for the deeper lessons and intuitions about consequentialism are environments where there is a lot of cut-throat competition, possibility for outlier success and failure, not artificially constrained or bounded in time or resources, etc.
Indeed, that section also jumped out at me as missing some pretty important parts of the arguments about instrumentally convergent goals. As Eliezer said in one of the MIRI dialogues:
Now, Turntrout recently noted that we aren’t actually sampling from the space of plans, but from the space of plan-generating agents, which seemes basically true! Except that what kind of agent we get is (probably substantially) influenced by the structure of that same reality which provides us with that unfavorable[1] distribution of “successful” plans[2]. This is something I think is downstream[3] of point 21 in A List of Lethalities:
Asserted without substantiation.
Though I don’t think this is even a necessary part of the argument; it merely makes the situation harder.
Also asserted.
But they don’t unpack to optimality being a real thing. No real entity actually optimizes anything, except maybe everything minimizes action. “It’s useful in economics” doesn’t mean you can just extrapolate it wherever.
What is supported by what? Is the claim that thinking about utility worked for economists, so everyone should think about utility, or that empirical research shows that anyone smart is trying to conquer the world, or what is the claim and what it is the evidence?
It is all ungrounded philosophy without quantifying what actual theories match reality by how much.