An artificially structured argument for expecting AGI ruin
Philosopher David Chalmers asked:
[I]s there a canonical source for “the argument for AGI ruin” somewhere, preferably laid out as an explicit argument with premises and a conclusion?
Unsurprisingly, the actual reason people expect AGI ruin isn’t a crisp deductive argument; it’s a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out.
That said, Eliezer Yudkowsky’s So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on.
In The Basic Reasons I Expect AGI Ruin, I wrote:
When I say “general intelligence”, I’m usually thinking about “whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems”.
It’s possible that we should already be thinking of GPT-4 as “AGI” on some definitions, so to be clear about the threshold of generality I have in mind, I’ll specifically talk about “STEM-level AGI”, though I expect such systems to be good at non-STEM tasks too.
STEM-level AGI is AGI that has “the basic mental machinery required to do par-human reasoning about all the hard sciences”, though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can’t solve physics problems, such as “lack of familiarity with the field”.
A simple way of stating the argument in terms of STEM-level AGI is:
Substantial Difficulty of Averting Instrumental Pressures: As a strong default, absent alignment breakthroughs, STEM-level AGIs that understand their situation and don’t value human survival as an end will want to kill all humans if they can.
Substantial Difficulty of Value Loading: As a strong default, absent alignment breakthroughs, STEM-level AGI systems won’t value human survival as an end.
High Early Capabilities. As a strong default, absent alignment breakthroughs or global coordination breakthroughs, early STEM-level AGIs will be scaled to capability levels that allow them to understand their situation, and allow them to kill all humans if they want.
Conditional Ruin. If it’s very likely that there will be no alignment breakthroughs or global coordination breakthroughs before we invent STEM-level AGI, then given 1+2+3, it’s very likely that early STEM-level AGI will kill all humans.
Inadequacy. It’s very likely that there will be no alignment breakthroughs or global coordination breakthroughs before we invent STEM-level AGI.
Therefore it’s very likely that early STEM-level AGI will kill all humans. (From 1–5)
I’ll say that the “invention of STEM-level AGI” is the first moment when an AI developer (correctly) recognizes that it can build a working STEM-level AGI system within a year. I usually operationalize “early STEM-level AGI” as “STEM-level AGI that is built within five years of the invention of STEM-level AGI”.
I think humanity is very likely to destroy itself within five years of the invention of STEM-level AGI. And plausibly far sooner — e.g., within three months or a year of the technology’s invention. A lot of the technical and political difficulty of the situation stems from this high level of time pressure: if we had decades to work with STEM-level AGI before catastrophe, rather than months or years, we would have far more time to act, learn, try and fail at various approaches, build political will, craft and implement policy, etc.
This argument focuses on “human survival”, but from my perspective the more important claim is that STEM-level AGI systems very likely won’t value awesome cosmopolitan outcomes at all. It’s not just that we’ll die; it’s that there probably won’t be anything else of significant value that the AGI creates in our place.
Elaborating on the five premises:
1. Substantial Difficulty of Averting Instrumental Pressures
In Superintelligence, Nick Bostrom defines an “Instrumental Convergence Thesis”:
[A]s long as they possess a sufficient level of intelligence, agents having any of a wide range of final goals will pursue similar intermediary goals because they have instrumental reasons to do so.
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.
Bostrom distinguishes between “instrumental goals” and “final goals” (“terminal goals” in Yudkowsky’s writing). I call the former “instrumental strategies” instead, to make it clearer that instrumental “goals” are just strategies for achieving ends.
For the argument to carry, it isn’t sufficient to argue that STEM-level AGI systems exhibit instrumental convergence at all; they need to exhibit catastrophic instrumental convergence, i.e., a wide variety of ends need to imply strategies that kill all humans (given the opportunity).
One way of arguing for 1 is via these three subclaims:
|1a. STEM-Level AGIs Exhibit Goal-Oriented Behavior by Default. As a strong default, STEM-level AGIs will have “goals”—or will at least look from the outside like they do. By this I mean that they’ll select outputs that competently steer the world toward particular states.|
1b. Goal-Oriented Systems Exhibit Catastrophic Instrumental Convergence. E.g., considering the instrumental strategies Superintelligence focuses on. For most states of the world you could ultimately be pushing toward (i.e., most “goals”), once you understand your situation well enough, you’ll tend to want there to exist optimizers that share your goal (“self-preservation”, “goal-content integrity”) and you’ll tend to want more power (“cognitive enhancement”, “technological perfection”), and resources (“resource acquisition”).
Humans are potential threats, and we consume (and are made out of) resources that can be put to other ends, so most goals that don’t specifically value human welfare as an end will endorse the conditional strategy “if you see a sufficiently cheap and reliable way to kill all humans, take that opportunity”.
1a and 1b suggest that if STEM-level AGI technology proliferates widely, we’re dead (conditional on 2+3+4). If it makes sense to try to build STEM-level AGI at all in that situation, then the obvious thing to do with your STEM-level AGI is to try to leverage its capabilities to prevent other AGIs from destroying the world (a “pivotal act”). But:
1c. Averting Instrumental Pressures in Pivotal-Act-Enabling AGI is Substantially Difficult. It looks very difficult to safely perform a pivotal act with an AGI system that doesn’t value human survival and flourishing as an end, because there’s no obvious way to avoid dangerous instrumental strategies in systems that capable.
Substantial alignment breakthroughs are very likely required here (and in value loading, interpretability, etc.). We likely won’t get such breakthroughs in time, though we should certainly put a huge effort into trying.
1a and 1b are in effect saying that the least informed and safety-conscious people in the world are likely to build AI systems with dangerous conditional incentives. If you don’t try at all to instill the right goals into your STEM-level AGI systems, and don’t otherwise try to avert these default instrumental pressures, then your systems will be catastrophically dangerous (if they become capable enough).
1c makes the much stronger claim that the most safety-conscious people will fail to avert these instrumental pressures, as a strong default. (Assuming they build AGI that’s powerful enough to possibly be useful for a pivotal act or any similarly ambitious feat.)
Chalmers asked for “canonical (or at least MIRI-canonical) cases for the premises (esp 1, 2, and 5)”, so I’ll collect some sources for supporting arguments here, though I don’t think there’s a single “canonical” source. Many of the arguments support multiple premises or sub-premises, so there’s some arbitrariness in where I mention these below.
I’m not aware of a good resource that fully captures the MIRI-ish perspective on 1a (“STEM-Level AGIs Exhibit Goal-Oriented Behavior by Default”), but from my perspective some of the key supporting arguments are:
Consequentialist Cognition: “Steering toward outcomes” is a relatively simple idea, with a relatively simple formal structure (preference orderings, functions from outcomes to actions that tend to produce them, etc.).
Coherent Decisions Imply Consistent Utilities and Coherence arguments imply a force for goal-directed behavior: Visible deviations from this structure tend to correspond to “throwing away resources for no reason”.
So humans, evolution- or SGD-ish processes, learned optimizers modifying their own thoughts or building successors, etc. have incentives to iron out these inefficiencies wherever possible.
Gwern Branwen’s Why Tool AIs Want To Be Agent AIs discusses other reasons goal-oriented behavior (and other aspects of “agency”, a term I usually try to avoid) tends to be incentivized where it’s an available option.
General-purpose science, technology, engineering, and mathematics work is hard, requiring lining up many ducks in a row. “Minds that try to steer toward specific world-states” are a relatively simple and obvious way to do sufficiently hard things. So even if humanity is only trying to do hard STEM work and isn’t specifically trying to produce goal-oriented systems, it’s likely that the way we first do this will involve goal-oriented systems.
This is also empirically what happened when evolution built scientific reasoners. Insofar as we can think of evolution as an optimization process, it was neither optimizing for “build goal-oriented systems” nor for “build STEM workers”, but was instead (myopically) optimizing for our ancestors’ brains to solve various local problems in their ancestral environment, like “don’t eat poisonous berries” and “get powerful coalitions of other humans to adopt strategies that are likelier to propagate my genes”. This happened to produce relatively general reasoning systems that exhibit goal-oriented behavior, and our cognitive generality and goal-oriented optimization then resulted in us becoming good at STEM further down the road (with no additional evolutionary optimization of our brains for STEM).
Having “goals” in the required sense is a more basic and disjunctive property than it may initially seem. It doesn’t necessarily require, for example:
… that the system be at all human-like, or that it have conscious human-style volition.
… that the system have an internal model of itself, or a model of its goals; or that it be reflectively consistent.
… that the system’s brain cleanly factor into a “goal” component plus other components.
… that the system have only one goal.
… that all parts of the system work toward the same goal.
… that the “goal” be a property of one AI system, rather than something that emerges from multiple systems’ interaction.
… that the system’s goal be perfectly stable over time.
… that the system’s goal be defined over the physical world vs. over its own mind.
… that the system’s output channel be a physical “action”, vs. (say) a text channel.
… that the system have a conventional output channel at all, vs. (say) programmers extracting information from its brain via interpretability tools.
… that the system or network-of-systems have no humans in the loop. If humans are manually passing information back and forth between different parts of the system or supersystem’s “mind”, this doesn’t necessarily address the core dangers, since being physically involved in the system’s cognition doesn’t mean that you personally understand the implications of what you’re doing and can avoid any dangerous steps in the process. Likewise, if humans are doing physical work for the AI rather than giving it actuators, the humans are the actuators from the AI’s perspective, and can be manipulated into doing things we wouldn’t on reflection want to do.
Instead, having “goals” in the relevant sense just requires that the system be steering toward outcomes at all — as opposed to, say, its outputs looking like a sphex’s reflex behavior, insensitive to the future’s state).
Considerations like the above are a lot of why I don’t even discuss “goals” in The Basic Reasons I Expect AGI Ruin. Instead, item 2 in that post emphasizes that all action sequences that push the world toward some sufficiently hard-to-reach state tend to be dangerous. The (catastrophic) instrumental convergence thesis holds for the action sequences themselves. I discuss “goals” more in this post mainly because I’m modeling Chalmers’ target audience as pretty different from my own in various ways.
People will want AGI to do very novel STEM work (and promising pivotal acts in particular seem to require novel STEM work). Regurgitating or mildly tweaking human insights is one thing; efficiently advancing the scientific frontier seems far harder with shallow, unfocused, unstrategic pattern regurgitation.
The Basic Reasons I Expect AGI Ruin (item 1): STEM-level cognition requires “an enormous amount of laserlike focus and strategicness when it comes to which thoughts you do or don’t think. A large portion of your compute needs to be relentlessly funneled into exactly the tiny subset of questions about the physical world that bear on the question you’re trying to answer or the problem you’re trying to solve. If you fail to be relentlessly targeted and efficient in ‘aiming’ your cognition at the most useful-to-you things, you can easily spend a lifetime getting sidetracked by minutiae, directing your attention at the wrong considerations, etc.”
If an AGI system needs to be strategic and outcome-oriented about the events inside its brain, then it will be much more difficult to keep it from being strategic and outcome-oriented about the events outside of its brain.
Some sources discussing arguments for 1b (“Goal-Oriented Systems Exhibit Catastrophic Instrumental Convergence”):
Superintelligence ch. 7, linked above.
The Value Learning Problem: Notes that “Whereas agents at similar capability levels have incentives to compromise, collaborate, and trade, agents with strong power advantages over others can have incentives to simply take what they want.”
Cf. a recent Yudkowsky tweet noting that humans aren’t optimal tools for most (non-human) ends.
AGI Ruin emphasizes that there’s no impossibility in producing AGI minds with basically whatever properties you want; it just looks too difficult for humanity to do, under time pressure, given anything remotely like our current technical understanding, before AGI causes an existential catastrophe.
To a large extent the reason we think this is just the reason Nate Soares gives in Ensuring Smarter-Than-Human Intelligence Has a Positive Outcome: “Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems.” But we can say more than that about the shape of some of the difficulties. (Keeping in mind that we think many of the difficulties will turn out to be things that aren’t on our radar today.)
Sources arguing for 1c (“Averting Instrumental Pressures in Pivotal-Act-Enabling AGI is Substantially Difficult”):
8: “The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve”.
11: “If cognitive machinery doesn’t generalize far out of the distribution where you did tons of training, it can’t solve problems on the order of ‘build nanotechnology’” (which seems like the rough capability level needed for using AGI to hit the pause button indefinitely on AGI proliferation).
Ensuring Smarter-Than-Human Intelligence Has a Positive Outcome (Section 2): Corrigibility (the general property of allowing yourself to be shut down, corrected, inspected, etc. rather than manipulating your operators or seizing control from them) turns out to be surprisingly hard to describe in a coherent and precise way.
Problem of Fully Updated Deference: Normative uncertainty doesn’t address the core obstacles to corrigibility.
Ngo and Yudkowsky on Alignment Difficulty: Corrigibility is anti-natural to general means-end reasoning. “[W]e can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it’s pointed towards.”
Quoting AGI Ruin: “‘[Y]ou can’t bring the coffee if you’re dead’ for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.”
Deep Deceptiveness: One method for averting instrumental pressures would be to train an AGI to halt its thought processes whenever it starts to approach dangerous topics. But this sort of approach is likely to be extremely brittle, likely to either fail catastrophically or cripple the system, because of issues like unforeseen maxima and nearest unblocked neighbors, and because different topics tend to be highly entangled in rich real-world domains, and because we don’t know how to specify which topics are “dangerous” (see premise 2, below).
Mild Optimization: A different approach to averting instrumental pressures would be to limit how hard the AI tries to achieve outcomes in general. This again runs into issues like “how do we avoid crippling the system in the process?”, as well as “seemingly mild optimizers often prefer to build, or self-modify into, non-mild optimizers”.
We can try to build AGI systems to actively want to stay mild, but this requires us to solve an unusually difficult form of the value loading problem.
(Unusually difficult because mildness actively runs counter to effectiveness and efficiency. Pushing for mildness often means unacceptably slowing down systems, and/or incentivizing systems to work against you and find ways to become less mild.)
2. Substantial Difficulty of Value Loading
When I say that “value loading is difficult”, I tend to distinguish four different claims:
|2a. Values Aren’t Shared By Default. If humans don’t try to align STEM-level AGI systems at all, then with very high probability, such systems won’t share our values. (With values like “don’t kill people” as a special case.)|
|2b. Full Value Loading is Extremely Difficult. Causing one of the very first STEM-level AGI systems to share all of our core values is ~impossibly difficult. Before we can shoot for a target like that and have a good chance of succeeding, we’ll need a lot of practice studying and working with powerful AGI systems, a lot of technical mastery and experience from aligning AGI on easier tasks, and a far deeper understanding of human values and how to robustly check whether we’re converging on them.|
If full value loading is going to be out of reach initially, then we can instead try to load enough goals into the first powerful AGI systems to at least cause them to not want to cause catastrophes (e.g., human extinction) while they’re performing various powerful tasks for us. But:
|2c. Sufficient-for-Safety Goal Loading is Substantially Difficult. As a strong default, absent alignment breakthroughs, we won’t be able to cause one of the first STEM-level AGI systems to have sufficient-for-safety goals. (E.g., we won’t be able to give it the subset of human morality required for it to do ambitious things without destroying the world).|
|2d. Pivotal Act Loading is Substantially Difficult. As a strong default, absent alignment breakthroughs, we won’t be able to safely cause one of the first STEM-level AGI systems to want to perform an operator-intended task that helps prevent the world from being destroyed by other AGIs.|
2a gives us a reason to care about 2b: if AGI won’t have our values by default, then the obvious response is to try to instill these values into the system. And 2b give us a reason to care about 2c: if we can’t have everything right off the bat, we can shoot for “enough to prevent disasters”.
2b, in combination with 1+3+4+5 (and 2c), again gives us a reason to care about pivotal acts and thereby motivates 1d. If it’s difficult to cause AGI systems to share our values, then (given 1, 3, etc.) we face an enormous danger from the first STEM-level AGI systems. This would hold even if 1c were false, since AGI tech will proliferate by default and, given wide access to AGI, sooner or later someone will run a powerful AGI without the safeties.
If we can use AGI to perform some pivotal act (or find some other way to pause AGI development and proliferation for as long as the research community needs), then we can take as much time as needed to nail down full value loading.
So the urgent priority is to find some way to be able to hit the breaks, either before humanity reaches STEM-level AGI, or before STEM-level AGI technology proliferates.
Some sources discussing arguments for 2a (“Values Aren’t Shared By Default”):
No Universally Compelling Arguments: Which arguments (including moral arguments) a mind finds “compelling” depends on the mechanistic behavior of that mind. For any given thought or action that is caused by an argument, we could in principle build a mind that responds differently to that same argument.
Orthogonality: An argument that “there can exist arbitrarily intelligent agents pursuing any kind of goal”. Just as “is” doesn’t imply “ought”, effective ability to pursue ends doesn’t imply any specific choice of ends. So the field’s normal approach to doing AI (“just try to make the thing smarter”) doesn’t give us alignment for free.
The Design Space of Minds-in-General: The abstract space of possible minds is enormous and diverse. As a special case, goals vary enormously across possible minds. (E.g., there exist enormously many preference orderings over world-histories.)
Anthropomorphic Optimism and Humans in Funny Suits: We tend to anthropomorphize inhuman optimization processes (e.g., evolution, or non-human animals), and we tend to forget how contingent human traits are. Correcting for biases like these should move us toward thinking values are less shared by default.
Superintelligent AI is Necessary for an Amazing Future, But Far From Sufficient: Human values evolved via a lengthy and complex process that surely involved many historical contingencies. But beyond this general point, we can also note specific features of human evolution that seem safety-relevant and are unlikely to be shared by STEM-level AGI systems.
Sources arguing for 2b (“Full Value Loading is Extremely Difficult”):
Complex Value Systems Are Required to Realize Valuable Futures: Humane values are surprisingly complex (containing many parts) and fragile (with many points of failure that destroy ~all of the future’s value). Also discussed more recently on Arbital.
This point also increases the expected difficulty of sufficient-for-safety goals and of pivotal acts: there are many ways for an AGI to cause disaster in the course of enabling a pivotal act, and there are many different dimensions on which powerful AGI systems need to be simultaneously safe.
Sources arguing for 2c (“Sufficient-for-Safety Goal Loading is Substantially Difficult”):
Cognitive Uncontainability, Context Disaster, and The Hidden Complexity of Wishes: Anticipating the full space of catastrophic hazards is hard, and it’s nontrivial to specify individual hazards that we do anticipate.
Niceness is unnatural and Detached Lever Fallacy: Many components of human value seem intuitively simple (e.g., “just be friendly and cooperative toward other agents”), but have many complex and contingent features that are required to produce outcomes we’d see as good. (Cf. “value-laden” on Arbital.)
Optimization Amplifies and Robust Delegation (arXiv version): Optimization amplifies slight differences between what we say we want and what we really want. Specifically, powerful optimization introduces (at least) four versions of Goodhart’s Law: regressional, extremal, causal, and adversarial. Powerful optimizers also tend to hack the repository of value.
AGI Ruin: Large parts of the post can be cited here. I would highlight 3 and 5–6 (in Section A), 10 and 12–15 (in Section B.1, “the distributional leap”), and 16–33 (all of Section B.2 on outer/inner alignment and all of Section B.3 on interpretability).
A central AI alignment problem: capabilities generalization, and the sharp left turn: Expands on a point from AGI Ruin: 21, “Capabilities generalize further than alignment once capabilities start to generalize far.” STEM-level general intelligence forms an attractor well, whereas alignment with human interests doesn’t. And “On the contrary, sliding down the capabilities well is liable to break a bunch of your existing alignment properties.”
Meta-rules for (narrow) value learning are still unsolved: It’s not clear, either in practice or in principle, what meta-procedure could be used to load the right values into an AGI over time, or what meta-meta-procedure could be used to figure out the right meta-procedure over time.
Low impact: A concrete example of a goal we might shoot for is “don’t have too large an impact”. AI Alignment: Why It’s Hard, and Where to Start discusses early failed attempts to define low-impact or corrigible goals that don’t cripple a system’s ability to do anything useful.
Other arguments for 2d (“Pivotal Act Loading is Substantially Difficult”):
AGI Ruin, 7 and 9: Based on the fact that nobody has come up with an example of a “pivotal weak act” (something AGI could do that’s weak enough to be clearly safe absent alignment efforts, but strong enough to save the world), it seems very likely that there are no such acts.
The argument for 2d heavily overlaps with the arguments for 2b and 2c. It matters for 2d what the range of plausible pivotal acts look like, and we haven’t published a detailed write-up on pivotal acts, though we discuss them a decent amount in the (lengthy) Late 2021 MIRI Conversations.
3. High Early Capabilities
I’ll distinguish three subclaims:
|3a. Some Early Developers Will Be Able to Make Dangerously Capable STEM-Level AGIs. In particular, capable enough to understand their situation (so incentives like “wipe out humans if you find a way to do so” become apparent, if the system isn’t aligned), and capable enough to gain a decisive strategic advantage if they want one.|
“Early developers” again means “within five years of the invention of STEM-level AGI”. In fact this needs to happen faster than that in order to support 3b and 3c:
3b. If Some Early Developers Can Do So, Many Early Developers Will Be Able To Do So. (Assuming the very first developers don’t kill us first; and absent defeaters like an AGI-enabled pivotal act or a sufficiently heavy-duty globally enforced ban.)
As a strong default, AGI tech will spread widely quite quickly. So even if the first developers are cautious enough to avoid disaster, we’ll face the issue that not everybody is cautious enough. And we’ll likely face this issue within only a few months or years of STEM-level AGI’s invention, which make government responses and AGI-mediated pivotal acts far more difficult.
|3c. If Many Early Developers Can Do So, Some Will Do So. (Again, absent defeaters.)|
Another important claim I’d endorse is “early STEM-level AGIs will be capable enough to perform pivotal acts”, but this is cause for hope rather than a distinct reason to worry (if you already accept 3a), so it isn’t a supporting premise for this particular argument.
MIRI has never written a canonical “here are all the reasons we expect STEM-level AGI to be very powerful” argument. Some relevant sources for 3a (“Some Early Developers Will Be Able to Make Dangerously Capable STEM-Level AGIs”) are:
AGI Ruin: Points 1 (“AGI will not be upper-bounded by human ability or human learning speed”) and 2 (decisive strategic advantage is reachable).
The Basic Reasons I Expect AGI Ruin (point 1).
Comments on Carlsmith’s “Is power-seeking AI an existential risk?” (“Background” section):
2. The bottleneck on decisive strategic advantages is very likely cognition (of a deep and high-quality variety).
The challenge of building the aforementioned nanomachines is very likely bottlenecked on cognition alone. (Ribosomes exist, and look sufficiently general to open the whole domain to any mind with sufficient mastery of protein folding, and are abundant.)
In the modern world, significant amounts of infrastructure can be deployed with just an internet connection—currency can be attained anonymously, humans can be hired to carry out various physical tasks (such as RNA synthesis) without needing to meet in person first, etc.
The laws of physics have shown themselves to be “full of exploitable hacks” (such as the harnessing of electricity to power lights in every home at night, or nuclear fission to release large amounts of energy from matter, or great feats of molecular-precision engineering for which trees and viruses provide a lower-bound).
3. The abilities of a cognitive system likely scale non-continuously with the depth and quality of the cognitions.
For instance, if you can understand protein folding well enough to get 90% through the reasoning of how your nanomachines will operate in the real world, that doesn’t let you build nanomachines that have 90% of the impact of ones that are successfully built to carry out a particular purpose.
I expect I could do a lot with 100,000 trained-software-engineer-hours, that I cannot do with 1,000,000 six-year-old hours.
Some defeaters for 3a could include “STEM-level AGI is impossible (e.g., because there’s something magical and special about human minds that lets us do science”, “there’s no way to leverage (absolute or relative) intelligence to take over the world”, and “early STEM-level AGIs won’t be (absolutely or relatively) smart enough to access any of those ways”.
I’d tentatively guess that “there will be lots of different STEM-level AGIs before any AGI can destroy the world” is false, but if it’s true, I think to a first approximation this doesn’t lower the probability of AGI ruin. This is because:
I still expect at least one early STEM-level AGI to be capable of unilaterally killing humans, if it wants to. Call this AGI “X”. If the other STEM-level AGIs don’t terminally value human survival, they will have no incentive to stop X from killing all humans (and in fact will have an incentive to help X if they can, to reduce the number of potential competitors and threats). This means that the existence of other misaligned AGIs doesn’t give X any incentive to avoid killing humans.
If no one STEM-level AGI is capable of unilaterally killing humans, I still expect early STEM-level AGIs to be able to coordinate to do so; and if they don’t terminally value human empowerment and coordination is required to disempower humans, I think they will in fact coordinate to disempower humans. This scenario is noted by Eliezer Yudkowsky here and here.
I view 3b (“If Some Early Developers Can Do So, Many Early Developers Will Be Able To Do So”) and 3c (“If Many Early Developers Can Do So, Some Will Do So”) as following from the normal way AI tech has proliferated over time: it didn’t take 10 years for other groups to match GPT-3 or ChatGPT once they were deployed, and there are plenty of incautious people who think alignment is silly, so it seems inevitable that someone will deploy powerful misaligned AGI if no major coordination effort or pivotal-act-via-AGI blocks this.
4. Conditional Ruin
Premises 1–3 each begin with “As a strong default...”, so one way to object to this premise is just to concede these are three “strong defaults”, but say they aren’t jointly strong enough to carry an “X is very likely” conclusion.
Depending on the conversational goal, I could respond by switching to a probabilistic argument, or by stipulating that “strong default” here means “strong enough to make premise 4 true”.
Beyond that, I think this claim is fairly obvious at a glance.
“[There will be no alignment breakthroughs or global coordination breakthroughs before we invent STEM-level AGI” is obviously a lot stronger than the conclusion requires: seeing breakthroughs in either domain doesn’t mean that the breakthroughs were sufficient to avert catastrophe. But I weakly predict that there in fact won’t be any breakthroughs in either domain, so this unnecessarily strong premise seems like a fine starting point.
When stronger claims are justifiable but weaker claims are sufficient, bad outcomes look more overdetermined, which strengthens the case for thinking we’re in a dire situation calling for an extraordinary response.
I don’t think MIRI has written a centralized argument regarding 5. We’re much more interested in intervening on it than in describing it, and if things are going well, it should look like a moving target.
We’ve written at least a little about why AGI timelines don’t look super long to us, and we’ve written at greater length about why alignment seems to us to be moving too slowly — e.g., in On How Various Plans Miss the Hard Bits of the Alignment Challenge and AGI Ruin. Posts like Security Mindset and Ordinary Paranoia, Security Mindset and the Logistic Success Curve, and Brainstorm of Things That Could Force an AI Team to Burn Their Lead help paint a qualitative picture of how hard we think it would be to actually succeed in STEM-level AGI alignment, and therefore how overdetermined failure looks.
The AGI ruin argument mostly rests on claims that the alignment and deployment problems are difficult and/or weird and novel, not on strong claims about society. The bar for a sufficient response seems high, and the responses required are unusual and extreme, with a high need for proactive rather than reactive action in the world.
Our arguments for discontinuous and rapid AI capability gains are possibly the main reason we’re more pessimistic than others about governments responding well. We also have unusually high baselines pessimism about government sanity by EA standards, but I don’t think this is the main source of model disagreement.
Other options include Joe Carlsmith’s Is Power-Seeking AI an Existential Risk? (which Nate Soares replied to here) and Katja Grace’s Argument for AI X-Risk from Competent Malign Agents.
Note that I’m releasing this post without waiting on other MIRI staff to endorse it or make changes, so this can be treated as my own attempt to build a structured argument, rather than as something Eliezer, Nate, Benya, or others would necessarily endorse.
Like “AGI”, “STEM-level AGI” lacks a formal definition. (If we did have a deep formal understanding of reasoning about the physical world, we would presumably be able to do many feats with AI that we cannot do today.)
Absent such a definition, however, we shouldn’t ignore the observed phenomenon that there’s a certain kind of problem-solving ability (observed in humans) that generalizes to inventing steam engines and landing on the Moon, even though our brains didn’t evolve under direct selection pressure to start industrial revolutions or visit other planets, and even though birds and nematodes can’t invent steam engines or land on the Moon.
We can then ask what happens when we find a way to automate this kind of problem-solving ability.
“The basic mental machinery” is vague, and maybe some would argue that GPT-4 already has all of the right “mental machinery” in some sense, in spite of its extremely limited ability to do novel STEM work in practice. (I disagree with this claim myself.)
E.g., some might analogize GPT-4 to a human child: a sufficiently young John von Neumann will lack some “basic mental machinery” required for STEM reasoning, but will at least have meta-machinery that will predictably unfold into the required machinery via normal brain development and learning.
(And, indeed, the difference between “having the basic mental machinery for STEM” and “having meta-machinery that will predictably unfold into the basic mental machinery” may not be a crisp one. Even the adult von Neumann presumably continued to upgrade his own general problem-solving software via adopting new and better heuristics.)
I don’t think that GPT-4 in fact has all of the basic mental machinery or meta-machinery for STEM, and I don’t personally think that comparing GPT-4 to a human child is very illuminating. I’m also not confident one way or the other about whether GPTs will scale to “as good at science as smart humans”.
That said, since people can disagree about the nature of general intelligence and about what’s actually going on in humans or AI systems when we do scientific work, it might be helpful to instead define “STEM-level” AI as AI technology that can (e.g.) match smart human performance in a specific hard science field, across all the scientific work humans do in that field.
As a strong default, I expect AI with that level of capability to be able to generalize to all the sciences, and to reasoning about any other topic humans can reason about; and that level of generality and capability seems to me to be the level where we face AI-mediated extinction risks.
The Arbital articles I link in this post, and most of the AI alignment content on Arbital, were written by Eliezer Yudkowsky in 2015–2017. I consider this one of the best online resources regarding AI alignment, though a lot of it is relatively unedited or incomplete.
If human whole-brain emulation is built before (or shortly after) STEM-level AGI, and this allows us to run human minds at faster speeds, then this opens up a lot more possibility for things to occur “early” (as measured in sidereal time).
It might even be possible to solve coherent extrapolated volition within five sidereal years of the invention of STEM-level AGI. (Though if so, I’m imagining this happening via ems and AI systems achieving feats that might have otherwise taken thousands of years of work, including enormous amounts of work gaining a mature understanding of the human mind, iteratively improving the ems’ speed and reasoning abilities, and very carefully and conservatively ratcheting up the capabilities of AI systems — and widening the set of tasks we can safely use them for — as we gain more mastery of alignment.)
To be clear: I’d consider it an obviously terrible idea, bordering on suicidal, to gamble the future on a pivotal act that does no monitoring or intervening in the wider world for five entire years after the invention of STEM-level AGI. I’d say that one year is already taking on a lot of risk, and three years is clearly too long.
But at the point where safety-conscious AGI developers are being cheaply run at 1000x speed relative to all the non-safety-conscious AGI developers, monitoring the world for planet-endangering threats (and intervening if necessary) is probably reasonably trivial. The hard part is getting to whole-brain emulation (and powerful hardware for running the ems) in the first place.
This is not, of course, to say that “AGI can achieve decisive strategic advantage within five years” is necessary for the AGI situation to be dire.
Also, “human survival” is a phrase some transhumanists (myself included) will object to as ambiguous. I think involuntary human death is bad, but I think it’s probably good if we voluntarily upload ourselves and develop into cool posthumans, regardless of whether that counts as biological “death” or “extinction” in some purely technical sense.
I use the phrase “human survival” in spite of all these issues because I (perhaps wrongly) imagine that Chalmers is looking for an argument that a wide variety of non-transhumanists will immediately see the importance of. Ordinary people can clearly see that it’s bad for AI to kill them and their loved ones (and can see why this is bad), without any need to wade into deep philosophical debates or utopia-crafting.
Focusing on something more abstract risks misleading people about the severity of the risk (“surely if you had something that scary in mind, you’d blurt it out rather than burying the lede”), and also about its nature (“surely if you thought AI would literally just kill everyone, you’d say that”). If I instead mostly worried about AI disaster scenarios where AI doesn’t literally kill everyone, I’d talk about those instead.
In principle one could make a simpler argument for pivotal acts by just saying “World-destroyingly-powerful AGI technology will proliferate by default, and if everyone has the ability to destroy the world then someone will inevitably do it on purpose”.
But in reality the situation is far worse than that, because even if we could limit AGI access to people who would never deliberately use AGI to try to do evil, AGI systems’ own default incentives make them extremely dangerous. Moreover, this issue blocks our ability to safely use AGI for pivotal acts as well.
It does matter that the system be able to generate hypotheses and instrumental strategies concerning the physical world; but the system’s terminal goal doesn’t need to concern to the physical world in order for the system to care about steering the physical world. E.g., a system that just wants its mind to be in a certain state will care about its hardware (since changes in hardware state will affect its mind), which means caring about everything in the larger world that could potentially affect its hardware.
Cf. Microscope AI in Hubinger’s An Overview of 11 Proposals for Building Safe Advanced AI.
Microscope AI also involves “using transparency tools to verify that the model isn’t performing any optimization”, but part of my argument here is that it’s extremely unlikely we’ll be able to get major new scientific/predictive insights from AI without it doing any “optimizing”. However, we might in principle be able to verify that the AI isn’t doing too much optimizing, or optimizing in the wrong directions, or optimizing over relatively risky domains, etc. In any case, we can consider the wider space of strategies that involve inspecting the AI’s mind as an alternative to using conventional outputs of the system.
If operators have enough visibility into the AGI’s mind, and enough deep understanding and useful tools for making sense of all important information in that mind, then in principle “do useful science by looking at the AGI’s mind rather than by giving it an output channel” can prevent any catastrophes that result from the AGI deliberately optimizing against human interests.
(Though we would still need to find ways to get the AGI to do specific useful cognitions and not just harmful ones. And also, if you have that much insight into the AGI’s mind and can get it to think useful and relevant thoughts at all, then you may be able to avoid Microscope-AI approaches, by trusting the AI’s outputs so long as it hasn’t had any dangerous thoughts anywhere causally upstream of the outputs.)
In real life, however, it’s very unlikely that we’ll have that level of mastery of the first STEM-level AGI systems. If we only have partial visibility and understanding of the AGI’s mind, then Microscope AI can in principle just be used by the AGI as another output channel, particularly if it learns or deduces things about which parts of its mind we’re inspecting, how we tend to interpret different states of its brain, etc. This is a more constrained problem from the AI’s perspective, but it still seems to demand some very difficult alignment breakthroughs for humanity to perform a pivotal act by this method.
Note that “sphexish” isn’t an all-or-nothing property, and if you zoom in on any agentic brain in enough detail, you should expect the parts to eventually start looking more sphexish. This is because “agency” isn’t a primitive property, but rather arises from the interaction of many gears, and sufficiently small gears will do things more automatically, without checking first to take into account context, etc.
The important question is: “To what extent do these sphex-like gears assemble into something that’s steering toward outcomes at the macro-level, versus assembling into something that’s more sphex-like at the macro-level?”
Quoting Yudkowsky in Ngo and Yudkowsky on Alignment Difficulty: “[A]n earlier part of the path [to building AGI systems that exhibit dangerous means-ends reasoning, etc.] is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together”.
Quoting Yudkowsky in Ngo and Yudkowsky on Scientific Reasoning and Pivotal Acts: “[...] Despite the inevitable fact that some surprises of this kind now exist, and that more such surprises will exist in the future, it continues to seem to me that science-and-engineering on the level of ‘invent nanotech’ still seems pretty unlikely to be easy to do with shallow thought, by means that humanity discovers before AGI tech manages to learn deep thought?
“What actual cognitive steps? Outside-the-box thinking, throwing away generalizations that governed your previous answers and even your previous questions, inventing new ways to represent your questions, figuring out which questions you need to ask and developing plans to answer them; these are some answers that I hope will be sufficiently useless to AI developers that it is safe to give them, while still pointing in the direction of things that have an un-GPT-3-like quality of depth about them.
“Doing this across unfamiliar domains that couldn’t be directly trained in by gradient descent because they were too expensive to simulate a billion examples of[.]
“If you have something this powerful, why is it not also noticing that the world contains humans? Why is it not noticing itself?”
Issues that are visible today probably won’t spontaneously solve themselves without a serious technical effort, but new obstacles can certainly crop up. (See the discussion of software development hell and robust-software-in-particular hell in The Basic Reasons I Expect AGI Ruin, and the “rocket-accelerating cryptographic Neptune probe” analogy in So Far: Unfriendly AI Edition.)
Note that “share all of our core values” is imprecise: what makes a value “core” in the relevant sense? How do we enable moral progress, and avoid locking in our current flawed values? It’s an extremely thorny problem. I endorse coherent extrapolated volition as a good (very high-level and abstract) description of desiderata for a solution. On LessWrong and Arbital, the phrase “humane values” is often used to specifically point at “the sort of values we ought to want to converge on eventually”, as opposed to our current incomplete and flawed conceptions of what’s morally valuable, aesthetically valuable, etc.
Note also that the challenge here is causing AGI systems to consistently optimize for humane values; it’s not merely to cause AGI systems to understand our values. The latter is far easier, because it doesn’t depend on the AGI’s goals; a sufficiently capable paperclip maximizer would also want to understand human goals, if its environment contained humans.
“Safely” doesn’t necessarily require that the AGI terminally values human survival. I’d put more probability on AGI systems being safe if they aren’t internally representing humans at all, with safety coming from this fact in combination with other alignment measures.
This doesn’t rule out that some responses to arguments are more common than others; and indeed, we should expect sufficiently capable minds to converge on similar responses to things like “valid logical arguments”, since accepting such arguments is very useful for being “sufficiently capable”.
The problem is that sufficiently capable reasoners don’t converge on accepting human morality. “Accept valid logical arguments” is useful for nearly all ambitious real-world ends, so we should expect it to arise relatively often as an instrumental strategy and/or as a terminal goal. “Care for humans” is useful for a far smaller range of ends.
Some relevant passages, discussing evolved aliens and then artificial minds:
“[...] I think my point estimate there is ‘most aliens are not happy to see us’, but I’m highly uncertain. Among other things, this question turns on how often the mixture of ‘sociality (such that personal success relies on more than just the kin-group), stupidity (such that calculating the exact fitness-advantage of each interaction is infeasible), and speed (such that natural selection lacks the time to gnaw the large circle of concern back down)’ occurs in intelligent races’ evolutionary histories.
“These are the sorts of features of human evolutionary history that resulted in us caring (at least upon reflection) about a much more diverse range of minds than ‘my family’, ‘my coalitional allies’, or even ‘minds I could potentially trade with’ or ‘minds that share roughly the same values and faculties as me’.
“Humans today don’t treat a family member the same as a stranger, or a sufficiently-early-development human the same as a cephalopod; but our circle of concern is certainly vastly wider than it could have been, and it has widened further as we’ve grown in power and knowledge.
“[… T]he development process of misaligned superintelligent AI is very unlike the typical process by which biological organisms evolve.
“Some relatively important differences between intelligences built by evolution-ish processes and ones built by stochastic-gradient-descent-ish processes:
“• Evolved aliens are more likely to have a genome/connectome split, and a bottleneck on the genome.
“• Aliens are more likely to have gone through societal bottlenecks.
“• Aliens are much more likely the result of optimizing directly for intergenerational prevalence. The shatterings of a target like ‘intergenerational prevalence’ are more likely to contain overlap with the good stuff, compared to the shatterings of training for whatever-training-makes-the-AGI-smart-ASAP. (Which is the sort of developer goal that’s likely to win the AGI development race and kill humanity first.)
“Evolution tends to build patterns that hang around and proliferate, whereas AGIs are likely to come from an optimization target that’s more directly like ‘be good at these games that we chose with the hope that being good at them requires intelligence’, and the shatterings of the latter are less likely to overlap with our values.”
A version of The Hidden Complexity of Wishes also appears in Complex Value Systems Are Required to Realize Valuable Futures.
Note that this is separate from the issue that it’s hard to instill particular goals into powerful AGI system at all. This point is discussed more in AGI Ruin.
Summarizing the relevant items:
3: “We need to get alignment right on the ‘first critical try’ at operating at a ‘dangerous’ level of intelligence”. This makes it more difficult to achieve any desired property in STEM-level AGI.
5 and 6: “We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.” If the system is weak, then flaws in its goals like “be low-impact” or “don’t hurt humans” matter less. But we need at least one system strong enough to help in some pivotal act (unless we find some way to globally limit AGI proliferation without the help of STEM-level AGI), which makes it far more dangerous if its goals are flawed.
10: “Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you.”
12: “Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes. Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch.”
13 and 14: “Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.”
15: “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.”
16: “Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.”
17: “[O]n the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.”
18: “[I]f you show an agent a reward signal that’s currently being generated by humans, the signal is not in general a reliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal”.
19: “More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment”.
20: “Human operators are fallible, breakable, and manipulable.”
21 and 22: “When you have a wrong belief, reality hits back at your wrong predictions. [...] Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.” Thus “Capabilities generalize further than alignment once capabilities start to generalize far.”
Section B.3 (25–33): Sufficiently good and useful transparency / interpretability seems extremely difficult.
“Why? Because things in the capabilities well have instrumental incentives that cut against your alignment patches. Just like how your previous arithmetic errors (such as the pebble sorters on the wrong side of the Great War of 1957) get steamrolled by the development of arithmetic, so too will your attempts to make the AGI low-impact and shutdownable ultimately (by default, and in the absence of technical solutions to core alignment problems) get steamrolled by a system that pits those reflexes / intuitions / much-more-alien-behavioral-patterns against the convergent instrumental incentive to survive the day.”
Quoting from footnote 3 of A central AI alignment problem: capabilities generalization, and the sharp left turn: “Note that this is consistent with findings like ‘large language models perform just as well on moral dilemmas as they perform on non-moral ones’; to find this reassuring is to misunderstand the problem. Chimps have an easier time than squirrels following and learning from human cues. Yet this fact doesn’t particularly mean that enhanced chimps are more likely than enhanced squirrels to remove their hunger drives, once they understand inclusive genetic fitness and are able to eat purely for reasons of fitness maximization. Pre-left-turn AIs will get better at various ‘alignment’ metrics, in ways that I expect to build a false sense of security, without addressing the lurking difficulties.”
The kinds of capabilities we expect to be needed for a pivotal act are similar to those required for the strawberry problem (“Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level.”). Yudkowsky’s unfinished Zermelo-Fraenkel provability oracle draft makes the specific claim that powerful theorem-proving wouldn’t help save the world.
Cf. AGI Ruin, point 4.
I think the easiest pivotal acts are somewhat harder than the easiest strategies a misasligned AGI could use to seize power; but (looking only at capability and not alignability) I expect AGI to achieve both capabilities at around the same time, coinciding with (or following shortly after) the invention of STEM-level AGI.