# Daniel Kokotajlo comments on Draft report on existential risk from power-seeking AI

• Thanks for this! I like your concept of APS systems; I think I might use that going forward. I think this document works as a good “conservative” (i.e. optimistic) case for worrying about AI risk. As you might expect, I think the real chances of disaster are higher. For more on why I think this, well, there are the sequences of posts I wrote and of course I’d love to chat with you anytime and run some additional arguments by you.

For now I’ll just say: 5% total APS risk (seems to me to) fail a sanity check, as follows:

1. There’s at least an X% chance of APS systems being made by 2035. (I think X = 60 and I think it’s unreasonable to have X<30 (and I’m happy to say more about why) but you’ll probably agree X is at least 10, right?)

2. Conditional on that happening, it seems like the probability of existential catastrophe is quite high, like 50% or so. (Conditional on APS happening that soon, takeoff is likely to be relatively fast, and there won’t have been much time to do alignment research, and more generally the optimistic slow takeoff picture in which we get lots of nice scary warning shots and society has lots of time to react will just not be true)

3. Therefore the probability of doom-by-APS-by-2035 is at least 0.5X, so at least 5%.

4. Therefore the probability of doom-by-APS-by-2070 must be significantly higher than 5%.

Also: It seems that most of your optimism comes from assigning only 40%*65%*40% ~= 10% chance to the combined claim “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity.” This to me sounds like you basically have 90% credence that the alignment problem will be solved and implemented successfully in time, in worlds where the problem is real (i.e. APS systems are possible and incentivized). I feel like it’s hard for me to be that confident, considering how generally shitty the world is at solving problems even when they are obvious and simple and killing people every day and the solution is known, and considering how this problem is disputed and complex and won’t be killing people until it is either already too late or almost and the solution is not known. Perhaps a related argument would be: Couldn’t you run your same arguments to conclude that the probability of nuclear war in the past 100 years was about 10%? And don’t we have good reason to think that in fact the probability was higher than that and we just got lucky? (See: the history of close calls, plus independently the anthropic shadow stuff)

• Hi Daniel,

Thanks for reading. I think estimating p(doom) by different dates (and in different take-off scenarios) can be a helpful consistency check, but I disagree with your particular “sanity check” here—and in particular, premise (2). That is, I don’t think that conditional on APS-systems becoming possible/​financially feasible by 2035, it’s clear that we should have at least 50% on doom (perhaps some of disagreement here is about what it takes for the problem to be “real,” and to get “solved”?). Nor do I see 10% on “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity” as obviously overconfident (though I do take some objections in this vein seriously). I’m not sure exactly what “10% on nuclear war” analog argument you have in mind: would you be able to sketch it out, even if hazily?

• Thanks for the thoughtful reply. Here are my answers to your questions:

Here is what you say in support of your probability judgment of 10% on “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity.”

Beyond this, though, I’m also unsure about the relative difficulty of creating practically PS-aligned systems, vs. creating systems that would be practically PS-misaligned, if deployed, but which are still superficially attractive to deploy. One commonly cited route to this is via a system actively pretending to be more aligned than it is. This seems possible, and predictable in some cases; but it’s also a fairly specific behavior, limited to systems with a particular pattern of incentives (for example, they need to be sufficiently non-myopic to care about getting deployed, there need to be sufficient benefits to deployment, and so on), and whose deception goes undetected. It’s not clear to me how common to expect this to be, especially given that we’ll likely be on the lookout for it.
More generally, I expect decision-makers to face various incentives (economic/​social backlash, regulation, liability, the threat of personal harm, and so forth) that reduce the attraction of deploying systems whose practical PS-alignment remains significantly uncertain. And absent active/​successful deception, I expect default forms of testing to reveal many PS-alignment problems ahead of time.
The 35% on this premise being false comes centrally from the fact that (a) I expect us to have seen a good number of warning shots before we reach really high-impact practical PS-alignment failures, so this premise requires that we haven’t responded to those adequately, (b) the time-horizons and capabilities of the relevant practically PS-misaligned systems might be limited in various ways, thereby reducing potential damage, and (c) practical PS-alignment failures on the scale of trillions of dollars (in combination) are major mistakes, which relevant actors will have strong incentives, other things equal, to avoid/​prevent (from market pressure, regulation, self-interested and altruistic concern, and so forth).
...
I’m going to say: 40%. There’s a very big difference between >\$1 trillion dollars of damage (~6 Hurricane Katrinas), and the complete disempowerment of humanity; and especially in slower take-off scenarios, I don’t think it at all a foregone conclusion that misaligned power-seeking that causes the former will scale to the latter.

As I read it, your analysis is something like: Probably these systems won’t be actively trying to deceive us. Even if they are, we’ll probably notice it and stop it since we’ll be on the lookout for it. Systems that may not be aligned probably won’t be deployed, because people will be afraid of dangers, thanks to warning shots. Even if they are deployed, the damage will probably be limited, since probably even unaligned systems won’t be willing and able to completely disempower humanity.

My response is: This just does not seem plausible conditional on it all happening by 2035. I think I’ll concede that the issue of whether they’ll be trying to deceive us is independent of whether timelines are short or long. However, in short-timelines scenarios there will be fewer (I would argue zero) warning shots, and less time for AI risk to be taken seriously by all the prestigious people. Moreover, takeoff is likely to be fast, with less time for policymakers and whatnot to react and less time for overseers to study and analyze their AIs. I think I’ll also concede that timelines is not correlated with willingness to disempower humanity, but it’s correlated with ability, due to takeoff speed considerations—if timelines are short, then when we get crazy AI we’ll be able to get crazier AI quickly by scaling up a bit more, and also separately it probably takes less time to “cross the human range.” Moreover, if timelines are short then we should expect prestigious people, institutions, etc. to be as collectively incompetent as they are today—consider how COVID was handled and is still being handled. Even if we get warning shots, I don’t expect the reactions to them to help much, instead simply patch over problems and maybe delay doom for a bit. AI risk stuff will become a polarized partisan political issue with lots of talking heads yelling at each other and lots of misguided people trying to influence the powers that be to do this or that. In that environment finding the truth will be difficult, and so will finding and implementing the correct AI-risk-reducing policies.

My nuclear winter argument was, at a high level, something like: Your argument for 10% is pretty general, and could be used to argue for <10% risk for a lot of things, e.g. nuclear war. Yet empirically the risk for those things is higher than that.

Your argument as applied to nuclear war would be something like: Probably nations won’t build enough nuclear weapons to cause nuclear winter. Even if they do, they wouldn’t set up systems with a risk of accident, since there would be warning shots and people would be afraid of the dangers. Even if there is a failure and a nuke is set off, it probably wouldn’t lead to nuclear winter since decision-makers would deescalate rather than escalate.

I would say: The probability of nuclear winter this century was higher than 10%, and moreover, nuclear winter is a significantly easier-to-avoid problem than APR-AI risk IMO, because psychologically and culturally it’s a lot easier to convince people that nukes are dangerous and that they shouldn’t be launched and that there should be lots of redundant safeguards on them than that [insert newest version of incredibly popular and profitable AI system here] is dangerous and shouldn’t be deployed or even built in the first place. Moreover it’s a lot easier, technically, to put redundant safeguards on nuclear weapons than to solve the alignment problem!

Nuclear winter was just the first thing that came to mind, but my argument would probably be a lot stronger if I chose other examples. The general idea is that on my reading of history, preventing APR-AI risk is just a lot harder, a lot less likely to succeed, than preventing various other kinds of risk, some of which in fact happened or very nearly happened.

• Hi Daniel,

Thanks for taking the time to clarify.

One other factor for me, beyond those you quote, is the “absolute” difficulty of ensuring practical PS-alignment, e.g. (from my discussion of premise 3):

Part of this uncertainty has to do with the “absolute” difficulty of achieving practical PS-alignment, granted that you can build APS systems at all. A system’s practical PS-alignment depends on the specific interaction between a number of variables—notably, its capabilities (which could themselves be controlled/​limited in various ways), its objectives (including the time horizon of the objectives in question), and the circumstances it will in fact exposed to (circumstances that could involve various physical constraints, monitoring mechanisms, and incentives, bolstered in power by difficult-to-anticipate future technology, including AI technology). I expect problems with proxies and search to make controlling objectives harder; and I expect barriers to understanding (along with adversarial dynamics, if they arise pre-deployment) to exacerbate difficulties more generally; but even so, it also seems possible to me that it won’t be “that hard” (by the time we can build APS systems at all) to eliminate many tendencies towards misaligned power-seeking (for example, it seems plausible to me that selecting very strongly against (observable) misaligned power-seeking during training goes a long way), conditional on retaining realistic levels of control over a system’s post-deployment capabilities and circumstances (though how often one can retain this control is a further question).

My sense is that relative to you, I am (a) less convinced that ensuring practical PS-alignment will be “hard” in this absolute sense, once you can build APS systems at all (my sense is that our conceptions of what it takes to “solve the alignment problem” might be different), (b) less convinced that practically PS-misaligned systems will be attractive to deploy despite their PS-misalignment (whether because of deception, or for other reasons), (c) less convinced that APS systems becoming possible/​incentivized by 2035 implies “fast take-off” (it sounds like you’re partly thinking: those are worlds where something like the scaling hypothesis holds, and so you can just keep scaling up; but I don’t think the scaling hypothesis holding to an extent that makes some APS systems possible/​financially feasible implies that you can just scale up quickly to systems that can perform at strongly superhuman levels on e.g. ~any task, whatever the time horizons, data requirements, etc), and (d) more optimistic about something-like-present-day-humanity’s ability to avoid/​prevent failures at a scale that disempower ~all of humanity (though I do think Covid, and its policitization, an instructive example in this respect), especially given warning shots (and my guess is that we do get warning shots both before or after 2035, even if APS systems become possible/​financially feasible before then).

Re: nuclear winter, as I understand it, you’re reading me as saying: “in general, if a possible and incentivized technology is dangerous, there will be warning shots of the dangers; humans (perhaps reacting to those warning shots) won’t deploy at a level that risks the permanent extinction/​disempowerment of ~all humans; and if they start to move towards such disempowerment/​extinction, they’ll take active steps to pull back.” And your argument is: “if you get to less than 10% doom on this basis, you’re going to give too low probabilities on scenarios like nuclear winter in the 20th century.”

I don’t think of myself as leaning heavily on an argument at that level of generality (though maybe there’s a bit of that). For example, that statement feels like it’s missing the “maybe ensuring practical PS-alignment just isn’t that hard, especially relative to building practically PS-misaligned systems that are at least superficially attractive to deploy” element of my own picture. And more generally, I expect to say different things about e.g. biorisk, climate change, nanotech, etc, depending on the specifics, even if generic considerations like “humans will try not to all die” apply to each.

Re: nuclear winter in particular, I’d want to think a bit more about what sort of probability I’d put on nuclear winter in the 20th century (one thing your own analysis skips is the probability that a large nuclear conflict injects enough smoke into the stratosphere to actually cause nuclear winter, which I don’t see as guaranteed—and we’d need to specify what level of cooling counts). And nuclear winter on its own doesn’t include a “scaling to the permanent disempowerment/​extinction of ~all of humanity” step—a step that, FWIW, I see as highly questionable in the nuclear winter case, and which is important to my own probability on AI doom (see premise 5). And there are various other salient differences: for example, mutually assured destruction seems like a distinctly dangerous type of dynamic, which doesn’t apply to various AI deployment scenarios; nuclear weapons have widespread destruction as their explicit function, whereas most AI systems won’t; and so on. That said, I think comparisons in this vein could still be helpful; and I’m sympathetic to points in the vein of “looking at the history of e.g. climate, nuclear risk, BSL-4 accidents, etc the probability that humans will deploy technology that risks global catastrophe, and not stop doing so even after getting significant evidence about the risks at stake, can’t be that low” (I talk about this a bit in 4.4.3 and 6.2).