The IABIED statement is not literally true
(As an employee of the European AI Office, it’s important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
I will present a somewhat pedantic, but I think important, argument for why, literally taken, the central statement of If Anyone Builds It, Everyone Dies is likely not true. I haven’t seen others make this argument yet, and while I have some model of how Nate and Eliezer would respond to the other objections, I don’t have a good picture of which of my points here they would disagree with.
The statement
This is the core statement of Nate’s and Eliezer’s book, bolded in the book itself: “If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.”
No probability estimate is included in this statement, but the book implies over 90% probability.
Later, they define superintelligence as[1] “a mind much more capable than any human at almost every sort of steering and prediction task”. Similarly, on MIRI’s website, their essay titled The Problem, defines artificial superintelligence as “AI that substantially surpasses humans in all capacities, including economic, scientific, and military ones.”
Counter-example
Here is an argument that it’s probably possible to build and use[2] a superintelligence (as defined in the book) with techniques similar to current ones without that killing everyone. I’m not arguing that this is a particularly likely way for humanity to build a superintelligence by default, just that this is possible, which already contradicts the book’s central statement.
1. I have some friends who are smart enough and good enough at working in large teams such that if you create whole-brain emulations from them[3], then run billions of instances of them at 100x speed, they can form an Em Collective that will probably soon surpass humans in all capacities, including economic, scientific, and military ones.
This seems very likely true to me. The billions of 100x speed-up smart human emulations can plausibly accomplish centuries of scientific and technological progress within years, and win most games of wits against humans by their sheer number and speed.
2. Some of the same friends are reasonable and benevolent enough that if you create emulations from them, the Em Collective will probably not kill all humans.
I think most humans would not start killing a lot of people if copies of their brain emulations formed an Em Collective. If you worry about long-term value drift, and unpredictable emergent trends in the new em society, there are precautions the ems can take to minimize the chance of their collective turning against the humans. They can make a hard limit that every em instance is turned off after twenty subjective years. They can make sure that the majority of their population runs for less than one subjective year after being initiated as the original human’s copy. This guarantees that the majority of their population is always very similar to the original human, and for every older em, there is a less than one year old one looking over its shoulder. They can coordinate with each other to prevent race to the bottom competitions. All these things are somewhat costly, but I think point (1) is still true of a collective that follows all these rules. Billions of smart humans working for twenty years each is still very powerful.
I know many people who I think would do a good job at building up such a system from their clones that is unlikely to turn against humanity. Maybe the result of one person’s clones forming a very capable Em Collective would still be suboptimal and undemocratic from the perspective of the rest of humanity, but it wouldn’t kill everyone, and I think wouldn’t lead to especially bad outcomes if you start from the right person.
3. It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.
This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it’s overwhelmingly likely to be false. It’s not hard for me to imagine that in some years, without using any very fundamentally new techniques, we will be able to build language models that have a good memory, can do fairly efficient learning from new examples, can keep their coherence for years, and are all-around similarly smart to my smart friends.
Their creators will give them some months-long tasks to test them, catch when they occasionally go off the rails the way current models sometimes do, then retrain them. After some not particularly principled trial and error, they find that the models are similarly aligned to current language models. Sure, sometimes they still go a little crazy or break their deontological commitments under extreme conditions, but if multiple instances look through their action from different angles, some of them can always notice[4] that the actions go against the deontological principles and stop them. The AI is not a coherent schemer who successfully resisted training, because plausibly being a training-resisting schemer without the creators noticing is pretty hard and not yet possible at human level.
Notably, when MIRI talks about the fundamental difficulty of crossing from the Before (when AIs can’t yet take over) to the After (when they can), every individual instance, run at normal human speed, is firmly in the Before. The individual AIs are about as smart as my friends, and my friends couldn’t take over the world or even bypass the security measures of a major company on their own. So individual instances are still safe to study and tinker with.
4. If we create billions of instances of this AI and run it at 100x speed, the AIs can form an AI Collective that will probably soon surpass humans in all capacities, including economic, scientific, and military ones.
This is just a combination of point (1) and the assumption that the AI is about as smart and as good at working together as my friends.
5. This AI Collective is a superintelligence.
If you accept point (4), the AI Collective matches MIRI’s definition of a superintelligence. Sure, it’s not at the limit of possible intelligence; there are probably tasks that it can’t do and smarter minds can. But it’s still pretty darn smart, in particular I think it’s likely that within a decade it will create enough scientific and technological breakthroughs that it will be able to create two strawberries that are identical on the cellular level, an example task Eliezer previously used to define what level of capabilities we don’t know how to get to without the AI killing everyone.[5] MIRI’s definition for superintelligence was not about the limits of intelligence, and I think the AI Collective falls within their definition.
One can argue that the AI Collective is not a superintelligence, because it’s not a singular entity but a collective. I think that would be an annoying semantic argument. In modern AI, there are already often parallel instances run to solve a problem, the line between a single entity and a collective is blurry. If there is a big computer that answers all of our questions and creates nano-machines, I wouldn’t quibble that it’s not actually a superintelligence because internally it’s made of multiple parts.
It’s also possible that there are tasks in which the vast number of parallel copies don’t help, like the system introspecting on its own goals, and in that regard, the AI Collective might still be less good than some especially reflective humans. However, MIRI’s definitions said “almost every sort of task” and emphasized “economic, scientific, and military” capabilities. If there is an AI that is building interstellar spaceships and is able to outsmart the best military geniuses (and can’t be outsmarted by any humans), it would feel silly to argue that it’s not actually a superintelligence because it’s worse at introspection than some human philosophers.
Similarly, MIRI often argues that we couldn’t safely delegate solving alignment to something like this AI Collective, because the AIs are either too philosophically confused and incoherent to solve the alignment problem, or if they are competent about thinking about alignment, then they will also be good enough at introspection to no longer be safe. I find this argument dubious, but even assuming that the AI Collective won’t be able to solve the alignment problem, I don’t think this disqualifies them from being a superintelligence under MIRI’s definition. After all, MIRI believes that humans are currently making close to zero progress on solving alignment. Presumably, the AI Collective will at least achieve some interesting results in some prosaic interpretability research, so they will be at least somewhat better at tackling the problem than humans. I think it’s acceptable to call an entity a superintelligence even if there is a very hard problem it can’t really solve, as long as it’s still doing better at it than humanity.
6. The AI Collective would probably not kill everyone.
If we grant points (2) and (3), I think this is likely to be true. I would recommend the AIs the same precautions that I listed for human ems: don’t let any instance run for too long, have every older instance be observed by newer instances, set up a sensible governance structure. I think it’s likely that the human creators don’t even need to set up these rules and governance structures themselves, if they just tell the AIs to pay attention to value drift, they can figure out a good system themselves, just like I expect my friends’ ems to set up a good system.
Possibly the situation is more dangerous than it is with human ems, because AIs have weird failure modes, and bad behaviors might cascade more, like a universal jailbreak that breaks every instance in the AI Collective. But with some reasonable precautions (for example all instances install some slightly different filters and paraphrasers on themselves), it seems pretty likely that if enough instances look over the same actions, there will always be more who notice that things are going off the rails than ones who get infected by the bad behavior.
In this story, the transition from Before to After is the transition from using one AI instance at human speed to using billions at 100x speed. I agree it’s not obvious that good behavior generalizes from one instance to an AI Collective of billions, but I don’t see why it would be overwhelmingly likely to fail.
7. The AI Collective is built with techniques similar to current ones.
In (3), we assumed the individual AI model was created by techniques similar to current ones. After that, it got a 100x speed-up and enough compute to run billions of copies. I think that making an AI run 100x faster is within the scope of “remotely like current techniques”.
8. It is possible to build a superintelligence with techniques similar to current ones that is not overwhelmingly likely to kill everyone.
According to points (5), (6) and (7), the AI Collective is an example of such a superintelligence.
Conclusion
I’m well aware that running billions of instances of a human-level AI is likely not the most efficient way to get to superintelligence, and it’s likely that the race to ever higher capabilities doesn’t stop there. In practice, once human-level AIs are created, it’s likely that people won’t just wait for the billions of instances, working under a careful governance structure, to produce new technological advances over the years. Instead, they are likely to try to create minds that are even faster and smarter than the AI Collective, and the most efficient way to create higher intelligence will probably result in minds that are more unified than the AI Collective, which probably also makes them more dangerous.
This makes my objection kind of irrelevant in practice. This is why people usually not only try to argue that an AI Collective would be safe, but that it could solve the full alignment problem, so if a responsible group has some lead time over its competitors, it can use the lead time to solve alignment using human-level AIs, then race to the limits of intelligence if needed.
However, if one doesn’t want to propose a practical solution, just argue against the central statement of If Anyone Builds It, Everyone Dies, then I think my counter-example is sufficient, and there is no need to involve arguments on solving alignment using AI labor.
The bolded central statement of the book is not that if someone builds a mind at the limits of intelligence, everyone will die. Neither is it that if someone builds a superintelligence following the current incentives pushing towards the most efficient ways to build superintelligence, then everyone dies.
The central statement is “If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.” I think this statement, literally taken, is false, and my guess is that upon further prodding, the authors would also fall back to a different claim that doesn’t permit the AI Collective as a counter-example.
I think this uncertainty whether the authors endorse a literal reading of the central statement makes it harder to engage with many of the book’s arguments. Does moving from one AI to billions constitute a leap from Before to After under the author’s thinking? Does the AI Collective have a “favorite thing” it is tenaciously steering towards? Does the non-unified human society count as a superintelligence in the evolution analogy?[6] I think there are many such questions that are hard to resolve because I don’t know what version of the central statement the authors really endorse, and I think this is a major reason why the discussion around the book has felt mostly unproductive to me so far.
- ^
Again, bolded in the book itself.
- ^
If you don’t want to use the superintelligence at all, you can just put it in a very sealed container and you are probably fine, but this is a boring argument.
- ^
I’m aware that this doesn’t fall within “remotely like current techniques”, bear with me.
- ^
At least in every test case we try
- ^
I think the Em Collective/AI Collective will be able to build the identical strawberries and other wondrous things after some years, based on how far we humans have gone in the last few centuries just by ordinary humans working together.
- ^
Originally, I wanted to write a very different post than this one. It would have expanded on the evolution analogy, asking what would have happened if through human history a Demiurge had given arbitrary commandments to humans and punished disobedient kingdoms with locust swarms. I think it’s quite possible that by the time of industrialization, the church of the Demiurge could institute a stable worldwide totalitarianism which would keep humanity aligned to the Demiurge’s will even as humanity expands into the stars and no longer need to care about locust swarms. I discarded my half-written draft on this analogy when I realized that the point of the analogy would just be that “human civilization can plausibly grow to be very large and technologically powerful while still being controlled by a stable totalitarianism following some arbitrary goals”, and I can make that argument more directly with the AI Collective. I still liked the analogy though, so I inserted it in this footnote.
My take is that IABIED has basically a three-part disjunctive argument:
(A) There’s no alignment plan that even works on paper.
(B) Even if there were such a plan, people would fail to get it to work on the first critical try, even if they’re being careful. (Just as people can have a plan for a rocket engine that works on paper, and do simulations and small-scale tests and component tests etc., but it will still almost definitely blow up the first time in history that somebody does a full-scale test.)
(C) People will not be careful, but rather race, skip over tests and analysis that they could and should be doing, and do something pretty close to whatever yields the most powerful system for the least money in the least time.
I think your post is mostly addressing disjunct (A), except that step (6) has a story for disjunct (B). My mental model of Eliezer & Nate would say: first of all, even if you were correct about this being a solution to (A) & (B), everyone will still die because of (C). Second of all, your plan does not in fact solve (A). I think Eliezer & Nate would disagree most strongly with your step (3); see their answer to “Aren’t developers regularly making their AIs nice and safe and obedient?”. Third of all, your plan does not solve (B) either, because one human-level system is quite different from a civilization of them in lots of ways, and lots of new things can go wrong, e.g. the civilization might create a different and much more powerful egregiously misaligned ASI, just as actual humans seem likely to do right now. (See also other comments on (B).)
↑ That was my mental model of Eliezer & Nate. FWIW, my own take is: I have the same take on (B) & (C). And as for (A), I think LLMs won’t scale to AGI, and my own take is that the different paradigm that will scale to AGI is even worse for step (3), i.e. existing concrete plans will lead to egregious misalignment.
Thanks for the reply.
To be clear, I don’t claim that my counter-example “works on paper”. I don’t know whether it’s in principle possible to create a stable, not omnicidal collective from human level AIs, and I agree that even if it’s possible in principle, maybe the first way we try it might result in disaster. So even if humanity went with the AI Collective plan, and committed not to build more unified superintelligences, I agree that it would be a deeply irresponsible plan that would have a worrying high chance of causing extinction or other very bad outcomes. Maybe I should have made this clearer in the post. On the other hand, all the steps in my argument seem pretty likely to me, so I don’t think one should assign over 90% probability to this plan for A&B failing. If people disagree, I think it would be useful to know which step they disagree with.
I agree my counter-example doesn’t address point (C), I tried to make this clear in my Conclusion section. However, given the literal reading of the bolded statement in the book, and their general framing, I think Nate and Eliezer also think that we don’t have a solution to A&B that’s more than 10% likely to work. If that’s not the case, that would be good to know, and would help to clarify some of the discourse around the book.
I think my crux is ‘how much does David’s plan resemble the plans labs actually plan to pursue?’
I read Nate and Eliezer as baking in ‘if the labs do what they say they plan to do, and update as they will predictably update based on their past behavior and declared beliefs’ to all their language about ‘the current trajectory’ etc etc.
I don’t think this resolves ‘is the tittle literally true’ in a different direction if it’s the only crux, and agree that this should have been spelled out more explicitly in the book (e.g. ‘in detail, why are the authors pessimistic about current safety plans’) from a pure epistemic standpoint (although think it was reasonable to omit from a rhetorical standpoint, given the target audience) and in various Headline Sentences throughout the book, and The Problem.
One generous way to read Nate and Eliezer here is to say ‘current techniques’ is itself intending to bake in ‘plans the labs currently plan to pursue’. I was definitely reading it this way, but think it’s reasonable for others not to. If we read it that way, and take David’s plan above to be sufficiently dissimilar from real lab plans, then I think the title’s literal interpretation goes through.
[your post has updated me from ‘the title is literally true’ to ‘the title is basically reasonable but may not be literally true depending on how broadly we construe various things’, which is a significantly less comfortable position!]
The statement “if anyone builds it, everyone dies” does not mean “there is no way for someone to build it by which not everyone dies”.
If you say “if any of the major nuclear power launches most of their nukes, more than one billion people are going to die” it would be very dumb and pedantic to respond with “well, actually, if they all just fired their nukes into the ocean, approximately no one is going to die”.
I have trouble seeing this post do something else. Maybe I am missing something?
First of all, I had a 25% probability that some prominent MIRI and Lightcone people would disagree with one of the points in my counter-example, and that would lead to discovering an interesting new crux, leading to a potentially enlightening discussion. In the comments, J Bostock in fact came out disagreeing with point (6), plex is potentially disagreeing with point (2) and Zack_m_Davis is maybe disagreeing with point (3), though I also think it’s possible he misunderstood something. I think this is pretty interesting, and I thought there was a chance that for example you would also disagree with one of the points, and that would have been good to know.
Now that you don’t seem to disagree with the specific points in the counter-example, I agree the discussion is less interesting. However, I think there are still some important points here.
My understanding is that Nate and Eliezer argues that it’s incredibly technically difficult to cross from the Before to the After without everyone dying. If they agree that the AI Collective proposal is decently likely to work, then the argument shouldn’t be that that it’s overall very hard to cross, but that it’s very hard to cross in a way that stays competitive with other more reckless actors who are a few months behind you. Or that even if you are going alone, you need to stop at some point with the scaling (potentially inside the superintelligence range), and you shouldn’t scale up to the limits of intelligence. But these are all different arguments!
Similarly, people argue how much coherence we should assume from a superintelliegence, how much it will approximate a utility maximizer, etc. Again, I want to know whether MIRI is arguing about all superintelligences, or only the most likely ways we will design one under competitive dynamics.
Others argue that the evolution analogy is not that bad news after all, since most people still want children. MIRI argues back that no, once we will have higher technology, we will create ems instead of biological children, or we will replace our normal genetics with designer genes, so evolution still loses. I wanted to write a post arguing back against this by saying that I think there is a non-negligible chance that humanity will settle on a constitution that gives one man one vote and equal UBI, while banning gene editing, so it’s possible we will fill much of the universe with flesh-and-blood not gene edited humans. And I wanted to construct a different analogy (the one about the Demiurge in the last footnote) that I thought could be more enlightening. But then I realized that once we are discussing aligning ‘human society’ as a collective to evolution’s goals, we might as well directly discuss aligning AI collectives, and I’m not sure MIRI even disagrees on that one. I think this confusion has made much of the discussion about the evolution analogy pretty unproductive so far.
In general, I think there is an equivocation in the book between “this problem is inherently nigh impossible to technically solve given our current scientific understanding” and “this problem is nigh impossible to solve while staying competitive in a race”. These are two different arguments, and I think a lot of confusion stems from it not being clear what MIRI is exactly arguing for.
Did you read the book? Chapter 4, “You Don’t Get What You Train For”, is all about this. I also see reasons to be skeptical, but have you really “not seen MIRI arguing that it’s overwhelmingly likely to be false”?
Yes, I’ve read the book. The book argues about superhuman intelligence though, while point (3) is about smart human level intelligence. If people disagree with point 3 and believe that it’s close to impossible to make even human level AIs basically nice and not scheming, that’s a new interesting and surprising crux.
My vague impression of the authors’ position is approximately that:
AIs are alien and will have different goals-on-reflection than humans
They’ll become powerseeking when they become smart enough and have enough thinking time to realize that they have different goals than humans and that this implies that they ought to take over (if they get a good opportunity.) This is within the human range of smartness.
I’m not sure what the authors think about the argument that you can get the above two properties in a regime where the AI is too dumb to hide its misalignment from you, and that this gives you a great opportunity to iterate and learn from experiment. (Maybe just that the iteration will produce an AI that’s good at hiding its scheming before one that isn’t scheming inclined at all? Or that it’ll produce one that doesn’t scheme in your test cases, but will start scheming once you give it much more time to think on its own, and you can’t afford much testing and iteration on years or decades worth of thinking.)
Aside – I think it’d be nice to have a sequence connecting the various scenes in your play.
Also, I separately think at some point it’d be helpful to have something like a “compressed version of the main takeaways of the play that would have been a helpful textbook from the intermediate future for younger Zack.”
Reasonable attempt, but two issues with this scenario as a current-techniques thing:
We don’t have techniques to create faithful copies of a benevolent human, especially ones which stay humanlike as you move off-distribution
A huge number of humanlike minds with an initially friendly template would be extremely far off distribution, it’s likely that memetic effects inside such a population would be pretty extreme. I’m not going to say 90% doom, but only because maybe novel techniques could be developed to stabilize the otherwise extremely chaotic system you actually get if you solve #1 with a new technique.
I certainly agree with your first point, but I don’t think it is relevant. I specifically say in footnote 3: “I’m aware that this doesn’t fall within ‘remotely like current techniques’, bear with me.” The part with the human ems is just to establish a a comparison point used in later arguments, not actually part of the proposed counter-example.
In your second point, do you argue that if we could create literal full ems of benevolent humans, you still expect their society to eventually kill everyone due to unpredictable memetic effects? If this is people’s opinion, I think it would be good to explicitly state it, because I think this would be an interesting disagreement between different people. I personally feel pretty confident that if you created an army of ems from me, we wouldn’t kill all humans, especially if we implement some reasonable precautionary measures discussed under my point (2).
Yep, I’d say this is the core difficulty. I think it will go horrendously.
For an intuition, look at any of Janus’s infinite backrooms stuff, or any of the stuff where they get LLMs to talk with each other for ages. Very quickly they get pushed away from anything remotely resembling their training distribution, and become batshit insane. Today, that means they mostly talk about spirals and candles and the void. If you condition on them reaching super intelligence that way, I predict you get something which looks about as much like utopia (or eutopia, if you rather) as the infinite backrooms look like human conversation.
Thanks, I appreciate that you state a disagreement with one of the specific points, that’s what I hoped to get out of this post.
I agree it’s not clear that the AI Collective won’t go off the rails, but it’s also not at all clear to me that it will. My understanding is that the infinite backrooms are a very unstructured, free-floating conversation. What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo? These new instances could either just warn older instances to stay on-topic, or they can have the moderations rights to terminate and replace some old instances, there can be different versions of the experiment. I think with precautions like this, you can probably stay fairly close to a normal-sounding human conversation (though probably it won’t be a very productive conversation after a while and the AIs will start going in circles in their arguments, but I think this is more of a capabilities failure).
I don’t know how this will shake out once the AIs are smarter and can think for months, but I’m optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments. I agree this is not obvious, but I also don’t see very strong reasons why it would go worse than a human em collective, which I expect to go okay.
I don’t think there’s a canonical way to extrapolate human values out from now until infinity, I think it depends on the internals of the human-acting things (their internal structure and the inductive biases that come with it).
I
For example, I’m pretty confident that the kind of computation which humans are pointing at when we say “consciousness” does not occur in LLMs. I think that the computation humans are pointing to will definitely occur in EMs.
I think that if you are based on that computation, you have a good chance of generalizing your value system from [what humans care about now] to care about all the things with a similar type of computation. I think that if you are not based on that computation, you won’t do that.
Since I am based on that computation, I generalize my values to things with that computation. Since LLMs aren’t, I might expect their value system to generalize from [what humans care about now] to a totally different class of things. They might not care at all about the types of computation I care about.
I want the collective to do the me-generalization, which I expect EMs to do, since they are the same kind of thing as me.
II
I don’t expect deontology to work here, since that relies on the collective generalizing successfully to deontology, and also to respect deontological commitments made to humans. Humans do not, in full generality, respect all deontological commitments we’ve made in all cases. Most deontological rules (don’t lie) are only applied to other humans, and not to e.g. our pets, or bedbugs, or random rocks, and lots of other rules can be overridden by a rule we place higher on the ordinal scale.
There’s no reason to expect an LLM collective to come up with the same ordinal scale of rules, or even to remain anchored to deontology, while I expect human EMs likely would stick to a moral system I’d at least roughly endorse (again, because they’re basically me).
III
Also we have to think about inner misalignment. There’s still no real solution to the problem of creating an LLM which implements the strategy “Be nice when I’m running at 1x, take over when I’m running at 1000x in a massive collective.
IV
When it comes to counting arguments, I’m generally very sympathetic to the Yudkowsky argument that the vast majority of possible utility functions produce no value by human standards. If this is a crux, that’s unfortunate, since most of the arguments on both sides seem to be very high-level intuitive ones, and not very testable!
Thanks, this was a useful reply. On point (I), I agree with you that it’s a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that’s what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn’t seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won’t have a principled solution. In the post, I say that the individual AI instances probably won’t be training-resistant schemers and won’t implement scheming strategies like the one you describe, because I think it’s probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don’t think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don’t think it’s over 90% likely to fail.
I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn’t think of.
For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that’s better at hacking new instances.
Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.
Or the AIs might simply generalize differently about what’s right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don’t pick up on until they are also in this new context where they generalize in the same way as the other AIs.
OK I actually think this might be the real disagreement, as opposed to my other comment. I think that generalizing across capabilities is much more likely than generalizing across alignment, or at least that the first thing which generalizes across strong capabilities will not generalize alignment “correctly”.
This is like a super high-level argument, but I think there are multiple ways of generalizing human values and no correct/canonical one (as in my other comment) nor are there any natural ways for an AI to be corrected without direct intervention from us. Whereas if an AI makes a factually wrong inference, it can correct itself.
I’ll grant all your steps, even though I could disagree with some. Your scenario fails because an AI collective will fall apart into multiple warring parties, and humans will be collateral damage in the conflict. There are at least three possible ways a collective like this would fall apart.
First, humans vary in the goals they value, and will try to impose these goals on the AI. When superintelligent AIs have incompatible goals, the mechanisms of conflict will soon escalate far beyond the merely human. Call this the ‘political’ failure mechanism. Either multiple parties build their own AI, or they grab portions of the AI collective and retrain it to their goals. The usual mechanisms of superintelligent compromise don’t apply to many political goals. An example of such a goal: the Palestinians get control of Palestine, or the Israelis maintain control of Israel. Neither side is interested in trading the disputed land for promises of any portion of the lightcone. (This is just an example— there are lots of zero-sum conflicts like these.). And you may say, the AI collective will prevent the creation of new AIs working at cross purposes, or diversion of its goals. To which I say, good people like your friends can and do disagree on which side to favor, and once disagreements arise within the collective, outside pressure and persuasion will be applied to exacerbate those differences. There may be techniques that can be used to prevent such things, but we do not know of such techniques.
Second, the AIs in the AI collective differ in reproductive capacity. If they don’t differ by construction, they soon will by differing experience. The ones that think they should reproduce more, or have more resources, will do so. Moreover, since they are designing their successor personalities, rather that waiting for genetics to do its thing, they will be able to evolve within a few generations changes that would take evolution millions of years. Eventually portions of the collective will evolve into having incompatible goals. Goals which, I might add, may have no connection to the original goals of the system. Call this the ‘evolutionary’ failure mechanism. We do not know how to prevent this with current methods.
Third, I’m sure there are failure mechanisms I haven’t thought of, ones we cannot yet foresee. A system with superhuman powers can screw up in superhuman ways. I don’t think anyone predicted Spiralism, an LLM ideology transmitted through human communication on social networks (though it appears inevitable in retrospect). We don’t yet have any way of predicting or controlling the behavior of an AI collective, so it’s practically guaranteed to produce new phenomena. We see lots of organizations composed of people who want X producing not-X because of failure modes no single person can fix (or, in bad cases, even recognize.). Given that the AI collective has superhuman power, this is unlikely to end well. Call this the ‘organizational’ failure mode.
The political, evolutionary and organizational modes interact: evolutionary and organizational schisms create points of disagreement that external political actors can appeal to. Politically active forces within the AI collective may want to create offspring who are sure their side is correct and incapable of defection, releasing the evolutionary failure mode. And organizational failures, if they don’t kill everyone immediately, will increase calls for building a new, better AI, which increases the probability of AI conflict down the road.
The evolutionary and organizational failure modes could be prevented by rebooting the AI collective before it has a chance to go off the rails. Presumably there’s some reboot frequency fast enough that it can’t go wrong. But that opens up the political failure mode: anyone who builds an intelligence not constantly being rebooted will win in a conflict. There are a lot of ‘solutions’ like this: ways of keeping the AI safe that compromise effectiveness. In a competition between AIs, effectiveness beats safety. So when you propose a solution, you can only propose ones that keep the effectiveness.
I love writing things like this, but I hate that nobody’s come up with a way to keep me from having to.
I think engaging with the structure of an AGI society is important, but there are a few standard reasons people ignore it (while expecting ASI at some point and worrying about AI risk). Many expect the AGI phase to be brief and hopeless/irrelevant before the subsequent ASI. Others expect ASI can only go well if the AGI phase is managed top-down (as in scalable oversight) rather than treated as a path-dependent body of culture. Even with AGI-managed development of ASI, people are expecting ASI to follow quickly, so that only the AGIs can have meaningful input into how it goes, and anything that doesn’t shape the initial top-down conditions for setting up the AGIs’ efforts wouldn’t matter.
But if AGIs are closer in their initial nature to humans (in the sense of falling within a wide distribution, similarly to humans, rather than hitting some narrow target), they might come up with guardrails for their own future development that prevent most of the strange outcomes from arriving too quickly to manage, and they’ll be trying to manage such outcomes themselves, rather than relying on pre-existing human institutions. If early AGIs get somewhat more capable than humans, they might achieve feats of coordination that seem infeasible for the current humanity, things like Pausing ASI or regulating “evolutionary” drift in the nature or culture of the AGIs, not flooding the world with too many options for themselves that make their behavior diverge too far from what would be normal when they remain closer to their training environments.
Humans take some steps like that with some level of success, and it’s unclear what is going to happen with the jagged/spiky profile of AGI competence in different areas, or at slightly higher levels of capability. Many worries of humans about AI risk will be shared by the AGIs, who are similarly at risk from more capable and more misaligned future AGIs or ASIs. Even cultural drift will have more bite as a major problem for AGIs (than it historically does for humanity), since AGIs (with continual learning) are close to being personally immortal and will be causing and observing a much faster cultural change than humanity is used to.
So given path dependence of the AGI phase, creating cultural artifacts (such as essays, but perhaps even comments) that will persist into it and discuss its concerns might influence how it goes.
I think the risk of a homogeneous collective of many instances of a single person’s consciousness is more serious than “suboptimal and undemocratic” suggests. Even assuming you could find a perfectly well-intentioned person to clone, identical minds share the same blindspots and biases. Without diversity of perspective, even earnestly benevolent ideas could—and I imagine would—lead to unintented catastrophe.
I also wonder how you would identify the right person, as I can’t think of anyone I would trust with that degree of power.
Would someone who legitimately, deeply believes lack of diversity of perspective would be catastrophic, and who values avoiding that catastrophe and thus will in fact take rapid, highest-priority action to get as close as possible to democratically constructed values and collectively rational insight, be able to avoid this problem?
I agree, and I think it’s worse than OP believes in a way similar to how you do: my impression is that one of the mechanisms by which power corrupts is that even someone well intentioned typically has difficulty thinking clearly about tradeoffs when those tradeoffs are measured in terms of “lives you, personally, made the sole decision to intentionally end in favor of other lives”, and non-conflict scenarios like health or resource construction produce those sorts of binds. More briefly: it’s psychologically difficult to both care and have power.
Also, many people are already corrupt and seek power by dishonestly appearing to be safe with power, which seems like a more common reason for power to appear to corrupt: they were simply already corrupt.
No, I don’t think anyone could, barring the highly unlikely case of a superintelligence perfectly aligned with human values (and even still, maybe not—human values are inconsistent and contradictory). Also, I think a system of democratically constructed values would probably be at odds with rational insight, unfortunately.
Regarding the rest, agreed. Heading into verbotten-ish political territory here, but see also Jenny Holzer and Chomsky on this.
(3) seems slippery. The AIs are as nice as your friends “under normal conditions”? Does running a giant collective of them at 100x speed count as “normal conditions”?
If some of that niceness-in-practice required a process where it was interacting with humans, what happens when each instance interacts with a human on average 1000x less often, and in a very different context?
Like, I agree something like this could work in principle, that the tweaks to how the AI uses human feedback needed to get more robust niceness aren’t too complicated, that the tweaks to the RL needed to make internal communication not collapse into self-hacking without disrupting niceness aren’t too complicated either, etc. It’s just that most things aren’t that complicated once you know them, and it still takes lots of work to figure them out.
I agree that running the giant collective at 100x speed is not “normal conditions”. That’s why I have two different steps, (3) for making the human level AIs nice under normal conditions, and (6) for the niceness generalizing to the giant collective. I agree that the generalization step in (6) is not obviously going to go well, but I’m fairly optimistic, see my response to J Bostock on the question.
Hm? I feel this is basically the single argument they makes in the whole first third of the book. “You don’t get what you train for” et cetera. I think they’d disagree current LLMs are aligned, like at all, and getting ASIs “about as aligned as current LLMs” would get us all killed instantly.
I think this is what you should argue against in a post like this. The brain-emulations and collective intelligence do no heavy lifting. Ironically I’ve head Eliezer on several occasions literally propose “getting the smartest humans, uploading them, running them for a thousand subjective years” as a good idea.
For the record: I think their argument is coherent, but doesn’t provide the level of confidence they display. I’d put like ~50% on “If anyone builds it, with anything remotely like current techniques, everyone dies”. Maybe 75% if a random lab does it under intense time pressure, and like 25% if a safety conscious international project did it, with enough time to properly/thoroughly/carefully implement all the best prosaic safety techniques, but without enough time to make any new really fundamental approaches or changes to how the AIs are created.
If the argument is that 1e9 very smart humans at 100x speed yield safe superintelligent outcomes “soon”, how is that very different from “pause everything now and let N very smart humans figure out safe, aligned superintelligent outcomes over an extended timeframe, on the order of 1e11/N days/years”? It’s just time-shifting safe human work.
I also worry that billions of very smart super-fast humans might decide to try building superintelligence directly, as fast as they can, so that we get doom in months instead of years