Are there any research questions you’re excited about people working on, for making AI go (existentially) well, that are not related to technical AI alignment or safety? If so, what? (I’m especially interested in AI strategy/governance questions)
Sam Clarke
I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn’t seem to exist, so here’s my attempt.
Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
Initially, this world looks great from a human perspective, and most people are much richer than they are today.
But things then go badly in one of two ways (or more likely, a combination of both).
[Part 1] Going out with a whimper
In the training process, we used easily-measurable proxy goals as objective functions, that don’t push the AI systems to do what we actually want e.g.
‘maximise positive feedback from your operator’ instead of ‘try to help your operator get what they actually want’
‘reduce reported crimes’ instead of ‘actually prevent crime’
‘increase reported life satisfaction’ instead of ‘actually help humans live good lives’
‘increasing human wealth on paper’ instead of ‘increasing effective human control over resources’
(We did this because ML needs lots of data/feedback to train systems, and you can collect much more data/feedback on easily-measurable objectives.)
Due to competitive pressures, systems continue being deployed despite some people pointing out this is a bad idea.
The goals of AI systems gradually gain more influence over the future relative to human goals.
Eventually, the proxies for which the AI systems are optimising come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory. In the end, we will either go extinct or be mostly disempowered.
(In some sense, this isn’t really a big departure from what is already happening today—just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).
[Part 2] Going out with a bang
These AI systems end up learning objectives that are unrelated to the objective functions used in the training process, because the objective they ended up learning was more naturally discovered during the training process (e.g. “don’t get shut down”).
The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down).
Early in training, the best way to do that is by being obedient (since systems understand that unobedient behaviour would get them shut down).
Then, once the systems become sufficiently capable, they attempt to acquire resources and influence to more effectively achieve their goals, including by eliminating the influence of humans. In the end, humans will most likely go extinct, because the systems have no incentive to preserve our survival.
Just wanted to say this is the single most useful thing I’ve read for improving my understanding of alignment difficulty. Thanks for taking the time to write it!
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
… seem like world models that make sense to me, given the surrounding justifications
FWIW, I don’t really understand those world models/intuitions yet:
Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
(This is partly based on this summary)
- 2 Mar 2022 14:33 UTC; 5 points) 's comment on Late 2021 MIRI Conversations: AMA / Discussion by (
- 23 Feb 2022 13:15 UTC; 1 point) 's comment on Inner Alignment: Explain like I’m 12 Edition by (
The AI systems in part I of the story are NOT “narrow” or “non-agentic”
There’s no difference between the level of “narrowness” or “agency” of the AI systems between parts I and II of the story.
Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
Put another way, both parts I and II are instances of the “second species” concern/gorilla problem: that AI systems will gain control of humanity’s future. (I think this is also identical to what people mean when they say “AI takeover”.)
As far as I can tell, this isn’t really a different kind of concern from the classic Bostrom-Yudkowsky case for AI x-risk. It’s just a more nuanced picture of what goes wrong, that also makes failure look plausible in slow takeoff worlds.
Instead, the key difference between parts I and II of the story is the way that the models’ objectives generalise.
In part II, it’s the kind of generalisation typically called a “treacherous turn”. The models learn the objective of “seeking influence”. Early in training, the best way to do that is by “playing nice”. The failure mode is that, once they become sufficiently capable, they no longer need to play nice and instead take control of humanity’s future.
In part I, it’s a different kind of generalisation, which has been much less discussed. The models learn some easily-measurable objective which isn’t what humans actually want. In other words, the failure mode is that these models are trying to “produce high scores” instead of “help humans get what they want”. You might think that using human feedback to specify the base objective will alleviate this problem (e.g. use learn a reward model from human demonstrations or preferences about a hard-to-measure objective). But this doesn’t obviously help: now, the failure mode is that the model learns the objective “do things that look to humans like you are achieving X” or “do things that the humans giving feedback about X will rate highly” (instead of “actually achieving X”).
Notice that in both of these scenarios, the models are mesa-optimizers (i.e. the learned models are themselves optimizers), and failure ensues because the models’ learned objectives generalise in the wrong way.
This was discussed in comments (on a separate post) by Richard Ngo and Paul Christiano. There’s a lot more important discussion in that comment thread, which is summarised in this doc.
Re the argument for “Why internalization might be difficult”, I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it’s not right.
Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:
~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of space/complexity.
Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it’s an error.
I’ve since been told about Tasshin Fogleman’s guided metta meditations, and have found their aesethic to be much more up my alley than the others I’ve tried. I’d expect others who prefer a more rationalist-y aesthetic to feel similarly.
The one called ‘Loving our parts’ seems particularly good for self-love practice.
I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don’t shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more ‘sticky’. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordination harder. (Analogy to how vaccine and mask hesitancy during Covid was partly due to insufficient trust in public health advice). Or more speculatively I could also imagine an extreme version of sticky, splintered epistemic bubbles this leading to moral stagnation/value lock-in.
Minor question on framing: I’m wondering why you chose to call this post “AI takeover without AGI or agency?” given that the effects of powerful persuasion tools you talk about aren’t what (I normally think of as) “AI takeover”? (Rather, if I’ve understood correctly, they are “persuasion tools as existential risk factor”, or “persuasion tools as mechanism for power concentration among humans”.)
Somewhat related: I think there could be a case made for takeover by goal-directed but narrow AI, though I haven’t really seen it made. But I can’t see a case for takeover by non-goal-directed AI, since why would AI systems without goals want to take over? I’d be interested if you have any thoughts on those two things.
Thanks for pointing this out. We did intend for cases like this to be included, but I agree that it’s unclear if respondents interpreted it that way. We should have clarified this in the survey instructions.
Relatedly: if we manage to solve intent alignment (including making it competitive) but still have an existential catastrophe, what went wrong?
One argument for alignment difficulty is that corrigibility is “anti-natural” in a certain sense. I’ve tried to write out my understanding of this argument, and would be curious if anyone could add or improve anything about it.
I’d be equally interested in any attempts at succinctly stating other arguments for/against alignment difficulty.
Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.
I’d be curious to hear how you think the Production Web stories differ from part 1 of Paul’s “What failure looks like”.
To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don’t generalise “well” (i.e. in ways desirable by human standards), because they’re optimising for proxies (namely, a cluster of objectives that could loosely be described as “maximse production” within their industry sector) that eventually come apart from what we actually want (“maximising production” eventually means using up resources critical to human survival but non-critical to machines).
From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?
It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they’re also implicit in Paul’s story), and Paul emphases failures of intent alignment more (but they’re also present in your story) - though this is consistent with having the same underlying threat model?
(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of “What failure looks like”, and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of “What failure looks like”. Here, I’m curious specifically about the two theat models I’ve picked out.)
(I could have lots of this totally wrong, and would appreciate being corrected if so)
Thanks for writing this! Here’s another, that I’m posting specifically because it’s confusing to me.
Value erosion
Takeoff was slow and lots of actors developed AGI around the same time. Intent alignment turned out relatively easy and so lots of actors with different values had access to AGIs that were trying to help them. Our ability to solve coordination problems remained at ~its current level. Nation states, or something like them, still exist, and there is still lots of economic competition between and within them. Sometimes there is military conflict, which destroys some nation states, but it never destroys the world.
The need to compete in these ways limits the extent to which each actor is able to spend their resources on things they actually want (because they have to spend a cut on competing, economically or militarily). Moreover, this cut is ever-increasing, since the actors who don’t increase their competitiveness get wiped out. Different groups start spreading to the stars. Human descendants eventually colonise the galaxy, but have to spend ever closer to 100% of their energy on their militaries and producing economically valuable stuff. Those who don’t get outcompeted (i.e. destroyed in conflict or dominated in the market) and so lose their most of their ability to get what they want.
Moral: even if we solve intent alignment, avoid catastrophic war or misuse of AI by bad actors, and other acute x-risks, the future could (would probably?) still be much worse than it could be, if we don’t also coordinate to stop the value race to the bottom.
Will MacAskill calls this the “actual alignment problem”
Wei Dai has written a lot about related concerns in posts like The Argument from Philosophical Difficulty
Did people say why they deferred to these people?
No, only asked respondents to give names
I think another interesting question to correlate this would be “If you believe AI x-risk is a severely important issue, what year did you come to believe that?”.
Agree, that would have been interesting to ask
Things that surprised me about the results
There’s more variety than I expected in the group of people who are deferred to
I suspect that some of the people in the “everyone else” cluster defer to people in one of the other clusters—in which case there is more deference happening than these results suggest.
There were more “inside view” responses than I expected (maybe partly because people who have inside views were incentivised to respond, because it’s cool to say you have inside views or something). Might be interesting to think about whether it’s good (on the community level) for this number of people to have inside views on this topic.
Metaculus was given less weight than I expected (but as per Eli (see footnote 2), I think that’s a good thing).
Grace et al. AI expert surveys (1, 2) were deferred to less than I expected (but again, I think that’s good—many respondents to those surveys seem to have inconsistent views, see here for more details. And also there’s not much reason to expect AI experts to be excellent at forecasting things like AGI—it’s not their job, it’s probably not a skill they spend time training).
It seems that if you go around talking to lots of people about AI timelines, you could move the needle on community beliefs more than I expected.
Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, “world model” evokes some object that has a map-territory relationship with the world. It’s not clear to me that GPT-3 has that.
Another part of me thinks: I’m confused. It seems just as reasonable to claim that it obviously has a world model that’s just not very smart. I’m probably using bad concepts and should think about this more.
Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall
Yeah, this — I now see what you were getting at!
Minor terminology note, in case discussion about “genomic/genetic bottleneck” continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard’s meaning), so genomic bottleneck seems like the better term to use.