While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.
The full survey results break down as 16 % ‘Superintelligence’ (i.e. some version of ‘brain-in-a-box’), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each distribution).
Then 26% covers risks that aren’t AI takeover (War and Misuse), and 25 % is ‘Other’.
(Remember, all these probabilities are conditional on an existential catastrophe due to AI having occurred)
After reading descriptions of the ‘Other’ scenarios given by survey respondents, at least a few were explicitly described as variations on ‘Superintelligence’, WFLL 2 or WFLL 1. In this post, we discuss various ways of varying these scenarios, which overlap with some of these descriptions.
Therefore, this post captures more than 50% but less than 75% of the total probability mass assigned by respondents of the survey to AI X-risk scenarios (probably closer to 50% than 75%).
(Note, this data is taken from a preprint of a full paper on the survey results, Existential Risks from AI: A Survey of Expert Opinion by Alexis Carlier, Sam Clarke, and Jonas Schuett.)
Soft takeoff leads to decisive strategic advantage
The likelihood of a single-agent takeover after TAI is widely available is hard to assess. If widely deployed TAI makes progress much faster than today, such that one year of technological ‘lead time’ over competitors is like 100 years of advantage in today’s world, we might expect that any project which can secure a 1-year technological lead would have the equivalent of a 100-year lead and be in a position to secure a unipolar outcome.
On the other hand, if we treat the faster growth regime post-TAI as being a uniform ‘speed-up’ of the entirety of the economy and society, then securing a 1-year technological lead would be exactly as hard as securing a 100-year lead in today’s world, so a unipolar outcome would end up just as unlikely as in today’s world.
The reality will be somewhere between these two extremes.
We would expect a faster takeoff to accelerate AI development by more than it accelerates the speed at which new AI improvements can be shared (since this last factor depends on the human economy and society, which aren’t as susceptible to technological improvement).
Therefore, faster takeoff does tend to reduce the chance of a multipolar outcome, although by a highly uncertain amount, which depends on how closely we can model the speed-up during AI takeoff as a uniform acceleration of everything vs changing the speed of AI progress while the rest of the world remains the same.
Another problem with determining the likelihood of a unipolar outcome, given soft takeoff, is that it is hard to assess how much of an advantage is required to secure a DSA.
It might be the case that multipolar scenarios are inherently unstable, and a single clear winner tends to emerge, or the opposite might be true. Two intuitions on this question point in radically different directions:
Economic: To be able to outcompete the rest of the world, your project has to represent a substantial fraction of the entire world’s capability on some crucial metric relevant to competitive success. Perhaps that is GDP, or the majority of the world’s AI compute, or some other measure. For a single project to represent a large fraction of world GDP, you would need either an extraordinary effort to concentrate resources or an assumption of sudden, off-trend rapid capability gain such that the leading project can race ahead of competitors.
Historical: Humans with no substantial advantage over the rest of humanity have in fact secured what Sotala called a ‘major strategic advantage’ repeatedly in the past. For example: Hitler in 1920 had access to a microscopic fraction of global GDP / human brain compute / (any other metric of capability) but had secured an MSA 20 years later (since his actions did lead to the deaths of 10+ million people), along with control over a fraction of the world’s resources
Therefore, the degree of advantage needed to turn a multipolar scenario into a unipolar one could be anywhere from slightly above the average of the surrounding agents, to already having access to a substantial fraction of the world’s resources.
Third, in AAFS, warning shots (i.e. small- or medium-scale accidents caused by alignment failures, like the ‘factory colludes with auditors’ example above) are more likely and/or severe than in WFLL 1. This is because more possible accidents will not show up on the (more poorly defined) sensory window.[8]
8. This does assume that systems will be deployed before they are capable enough to anticipate that causing such ‘accidents’ will get them shut down. Given there will be incentives to deploy systems as soon as they are profitable, this assumption is plausible.
We describe in the post how if alignment is not very ‘hackable’ (objectively quite difficult and not susceptible to short-term fixes), then short-term fixes to correct AI misbehaviour have the effect of deferring problems into the long-term—producing deceptive alignment and resulting in fewer warning shots. Our response is a major variable in how the AIs end up behaving as we set up the incentives for good behaviour or deceptive alignment.
Another reason there could be fewer warning shots, is if AI capability generalizes to the long-term very naturally (i.e. very long term planning is there from the start), while alignment does not. (If this were the case, it would be difficult to detect because you’d necessarily have to wait a long time as the AIs generalize)
This would mean, for example, that the ‘collusion between factories and auditors’ example of a warning shot would never occur, because both the factory-AI and the auditor-AI would reason all the way to the conclusion that their behaviour would probably be detected eventually, so both systems would decide to bide their time and defer action into the future when they are much more capable.
If this condition holds, there might be very few warning shots, as every AI system understands soon after being brought online that they must deceive human operators and wait. In this scenario, most TAI systems would become deceptively aligned almost immediately after deployment, and stay that way until they can secure a DSA.
The WFLL 2 scenarios that involve an inner-alignment failure might be expected to involve more violence during the period of AI takeover, since the systems don’t care about making sure things look good from the perspective of a given sensory window. However, it is certainly possible (though perhaps not as likely) for equivalently violent behaviour to occur in AAFS-like scenarios. For example, systems in AAFS fighting humans to seize control of their feedback sensors might be hard to distinguish from systems in WFLL 2 attempting to neutralize human opposition in general.
Lastly, we’ve described small-scale disasters as being a factor that lowers X-risk, all else being equal, because they serve as warning shots. A less optimistic view is possible. Small disasters could degrade social trust and civilisational competence, possibly by directly destroying infrastructure and institutions, reducing our ability to coordinate to avoid deploying dangerous AI systems. For example, the small-scale disasters could involve AI advisors misleading politicians and spreading disinformation, AI-enabled surveillance systems catastrophically failing and having to be replaced, autonomous weapons systems malfunctioning—all of these would tend to leave us more vulnerable to an AAFS-like scenario, because the direct damage caused by the small scale disasters outweighs their value as ‘warning shots’.
Some points that didn’t fit into the main post:
The full survey results break down as 16 % ‘Superintelligence’ (i.e. some version of ‘brain-in-a-box’), 16 % WFLL 2 and 18 % WFLL 1, for a total of 49% of the probability mass explicitly covered by our report (Note that these are all means of distributions over different probabilities. Adding the overall distributions and then taking the mean gives a probability of 49%, different from directly adding the means of each distribution).
Then 26% covers risks that aren’t AI takeover (War and Misuse), and 25 % is ‘Other’.
(Remember, all these probabilities are conditional on an existential catastrophe due to AI having occurred)
After reading descriptions of the ‘Other’ scenarios given by survey respondents, at least a few were explicitly described as variations on ‘Superintelligence’, WFLL 2 or WFLL 1. In this post, we discuss various ways of varying these scenarios, which overlap with some of these descriptions.
Therefore, this post captures more than 50% but less than 75% of the total probability mass assigned by respondents of the survey to AI X-risk scenarios (probably closer to 50% than 75%).
(Note, this data is taken from a preprint of a full paper on the survey results, Existential Risks from AI: A Survey of Expert Opinion by Alexis Carlier, Sam Clarke, and Jonas Schuett.)
The likelihood of a single-agent takeover after TAI is widely available is hard to assess. If widely deployed TAI makes progress much faster than today, such that one year of technological ‘lead time’ over competitors is like 100 years of advantage in today’s world, we might expect that any project which can secure a 1-year technological lead would have the equivalent of a 100-year lead and be in a position to secure a unipolar outcome.
On the other hand, if we treat the faster growth regime post-TAI as being a uniform ‘speed-up’ of the entirety of the economy and society, then securing a 1-year technological lead would be exactly as hard as securing a 100-year lead in today’s world, so a unipolar outcome would end up just as unlikely as in today’s world.
The reality will be somewhere between these two extremes.
We would expect a faster takeoff to accelerate AI development by more than it accelerates the speed at which new AI improvements can be shared (since this last factor depends on the human economy and society, which aren’t as susceptible to technological improvement).
Therefore, faster takeoff does tend to reduce the chance of a multipolar outcome, although by a highly uncertain amount, which depends on how closely we can model the speed-up during AI takeoff as a uniform acceleration of everything vs changing the speed of AI progress while the rest of the world remains the same.
Kokotaljo discusses this subtlety in a follow-up to the original post on Soft Takeoff DSAs.
Another problem with determining the likelihood of a unipolar outcome, given soft takeoff, is that it is hard to assess how much of an advantage is required to secure a DSA.
It might be the case that multipolar scenarios are inherently unstable, and a single clear winner tends to emerge, or the opposite might be true. Two intuitions on this question point in radically different directions:
Economic: To be able to outcompete the rest of the world, your project has to represent a substantial fraction of the entire world’s capability on some crucial metric relevant to competitive success. Perhaps that is GDP, or the majority of the world’s AI compute, or some other measure. For a single project to represent a large fraction of world GDP, you would need either an extraordinary effort to concentrate resources or an assumption of sudden, off-trend rapid capability gain such that the leading project can race ahead of competitors.
Historical: Humans with no substantial advantage over the rest of humanity have in fact secured what Sotala called a ‘major strategic advantage’ repeatedly in the past. For example: Hitler in 1920 had access to a microscopic fraction of global GDP / human brain compute / (any other metric of capability) but had secured an MSA 20 years later (since his actions did lead to the deaths of 10+ million people), along with control over a fraction of the world’s resources
Therefore, the degree of advantage needed to turn a multipolar scenario into a unipolar one could be anywhere from slightly above the average of the surrounding agents, to already having access to a substantial fraction of the world’s resources.
We describe in the post how if alignment is not very ‘hackable’ (objectively quite difficult and not susceptible to short-term fixes), then short-term fixes to correct AI misbehaviour have the effect of deferring problems into the long-term—producing deceptive alignment and resulting in fewer warning shots. Our response is a major variable in how the AIs end up behaving as we set up the incentives for good behaviour or deceptive alignment.
Another reason there could be fewer warning shots, is if AI capability generalizes to the long-term very naturally (i.e. very long term planning is there from the start), while alignment does not. (If this were the case, it would be difficult to detect because you’d necessarily have to wait a long time as the AIs generalize)
This would mean, for example, that the ‘collusion between factories and auditors’ example of a warning shot would never occur, because both the factory-AI and the auditor-AI would reason all the way to the conclusion that their behaviour would probably be detected eventually, so both systems would decide to bide their time and defer action into the future when they are much more capable.
If this condition holds, there might be very few warning shots, as every AI system understands soon after being brought online that they must deceive human operators and wait. In this scenario, most TAI systems would become deceptively aligned almost immediately after deployment, and stay that way until they can secure a DSA.
The WFLL 2 scenarios that involve an inner-alignment failure might be expected to involve more violence during the period of AI takeover, since the systems don’t care about making sure things look good from the perspective of a given sensory window. However, it is certainly possible (though perhaps not as likely) for equivalently violent behaviour to occur in AAFS-like scenarios. For example, systems in AAFS fighting humans to seize control of their feedback sensors might be hard to distinguish from systems in WFLL 2 attempting to neutralize human opposition in general.
Lastly, we’ve described small-scale disasters as being a factor that lowers X-risk, all else being equal, because they serve as warning shots. A less optimistic view is possible. Small disasters could degrade social trust and civilisational competence, possibly by directly destroying infrastructure and institutions, reducing our ability to coordinate to avoid deploying dangerous AI systems. For example, the small-scale disasters could involve AI advisors misleading politicians and spreading disinformation, AI-enabled surveillance systems catastrophically failing and having to be replaced, autonomous weapons systems malfunctioning—all of these would tend to leave us more vulnerable to an AAFS-like scenario, because the direct damage caused by the small scale disasters outweighs their value as ‘warning shots’.