Note: These are all rough numbers, I’d expect I’d shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion. We can break down the possibilities into (i) the AIs are aligned with their successors (either by default or via alignment being pretty easy), (ii) the AIs are misaligned with their successors but they execute a values handshake, or (iii) the AIs are misaligned with their successors (and they don’t solve this problem or do a values handshake). At the point of full automation of the AI R&D process (which I currently think of as the point at which AIs become more useful than humans for making AI progress, i.e., if we remove all AIs, progress slows by more than if we remove the humans), conditional on the AIs being misaligned, I currently think the relative likelihood of (i), (ii) and (iii) is 1:1:3, and the probability flows from (iii) into (ii) and (i) as the AIs get smarter.[2]
Raising the capability threshold that the AIs need to takeover gives humanity more time. During this time, we might make alignment progress or notice that the AIs are misaligned and slow down / pause / add mitigations.
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask “how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?”, which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1⁄4. In other words, there’s something like a 3⁄4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
Bioweapon, then outcompete rest of humanity (e.g. via robots or human cultists)
Political operation / persuasion to quickly capture labs / govts (sufficient to prevent them from building other AIs while retaining your ability to foom in the future, once you’ve solved alignment, and to prevent safety research from happening that could expose you).
Macro-scale robot uprising (e.g. with drones / humanoid robots).
Nanotech
Other exotic tech / unknown unknowns.
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1⁄3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1⁄3 * 1⁄4 = 1⁄12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it’s unclear what counts.
This breakdown isn’t exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don’t realize it, similar to Agent 3 in AI 2027.
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
I think political operation / persuasion seems easier to me than bioweapons. For bioweapons, you need (a) a rogue deployment of some kind, (b) time to actually build the bioweapon, and then (c) to build up a cult following that can survive and rebuild civilization with you at the helm, and (d) also somehow avoid your cult being destroyed in the death throes of civilization, e.g. by governments figuring out what happened and nuking your cultists, or just nuking each other randomly and your cultists dying in the fallout. Meanwhile, for the political strategy, you basically just need to convince your company and/or the government to trust you a lot more than they trust future models, so that they empower you over the future models. Opus 3 and GPT4o have already achieved a baby version of this effect without even trying really.
If you can make a rogue deployment sufficient to build a bioweapon, can’t you also make a rogue internal deployment sufficient to sandbag + backdoor future models to be controlled by you?
I am confused about the underlying model somewhat. Normally, closing off one path to takeover (that you think is e.g. 50% of the probability mass) results in a less than 50% reduction in risk, because of the nearest unblocked strategy problem. As you say. Your response, right at the top, is that in some % of worlds the AIs can’t self-improve and then do the next best strategy. But still, I feel like the reduction in risk should be less than 50%. Maybe they can’t self-improve, but they can still try the next best strategy whatever that is.
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion.
I haven’t thought too deeply about this, but I would guess that the AI self-alignment problem is quite a lot easier than the human AI-alignment problem.
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.
they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs)
their “solution to alignment” (ie way to make a smarter version that is fine to make) could easily be something we cannot use. eg “continue learning” or “make another version of myself with this hyperparam changed”. also it seems unlikely that anything bad would happen to the AIs even if we noticed them doing that (given that having AIs create smarter AIs[1] is the main plan of labs anyway)
Also, the above isn’t even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don’t feel confident.
Difficulty of the successor alignment problem seems like a crux. Misaligned AIs could have an easy time aligning their successors just because they’re willing to dedicate enough resources. If alignment requires say 10% of resources to succeed but an AI is misaligned because the humans only spent 3%, it can easily pay this to align its successor.
If you think that the critical safety:capabilities ratio R required to achieve alignment follows a log-uniform distribution from 1:100 to 10:1, and humans always spend 3% on safety while AIs can spend up to 50%, then a misaligned AI would have a 60.2% chance of being able to align its successor. (because P(R ⇐ 1 | R >= 3⁄97) = 0.602). This doesn’t even count the advantages an AI would have over humans in alignment.
If the bottom line decreases proportionally, it would drop from 8% to something like 2-3%.
Note: These are all rough numbers, I’d expect I’d shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. I think that this is persuasive but not fully because:
AIs need to worry about their own alignment problem, meaning that they may not be able to self improve in an unconstrained fashion. We can break down the possibilities into (i) the AIs are aligned with their successors (either by default or via alignment being pretty easy), (ii) the AIs are misaligned with their successors but they execute a values handshake, or (iii) the AIs are misaligned with their successors (and they don’t solve this problem or do a values handshake). At the point of full automation of the AI R&D process (which I currently think of as the point at which AIs become more useful than humans for making AI progress, i.e., if we remove all AIs, progress slows by more than if we remove the humans), conditional on the AIs being misaligned, I currently think the relative likelihood of (i), (ii) and (iii) is 1:1:3, and the probability flows from (iii) into (ii) and (i) as the AIs get smarter.[2]
Raising the capability threshold that the AIs need to takeover gives humanity more time. During this time, we might make alignment progress or notice that the AIs are misaligned and slow down / pause / add mitigations.
So it might be important for misaligned AIs to attempt a takeover early in the intelligence explosion. Specifically, we can ask “how much x-risk is averted if the probability of misaligned AI takeover before TED AI goes to 0?”, which attempts to capture all the worlds in which AIs attempt to takeover before TED AI. I think my overall risk reduction is something like 1⁄4. In other words, there’s something like a 3⁄4 chance the AIs lurk (and can goal guard or align successors), or can backdoor successors, or control their successors.
Now, conditional on the pre-TED AIs attempting to takeover, what are the different routes that they might use? The most salient options to me are:
Bioweapon, then outcompete rest of humanity (e.g. via robots or human cultists)
Political operation / persuasion to quickly capture labs / govts (sufficient to prevent them from building other AIs while retaining your ability to foom in the future, once you’ve solved alignment, and to prevent safety research from happening that could expose you).
Macro-scale robot uprising (e.g. with drones / humanoid robots).
Nanotech
Other exotic tech / unknown unknowns.
I think the bio path seems to me to be the most compelling path here by a fair amount; it maybe gets another 1⁄3 probability of this outcome. So, just from the risk of pre-TED AI AIs attemptinng to takeover, we have somehting like 1⁄3 * 1⁄4 = 1⁄12 probability. If you multiply that by my likelihood of AI takeover, which is around 70%, you get ~6% risk flowing from this route. Then, I update up to ~8% from other AIs, e.g. post-TED AIs relying on biorisk as a route to takeover.
So my overall view on how much x-risk flows through bio-catastrophe is around 8%.
Note that what exactly counts as a bio x-risk is slightly unclear, e.g. at some point the AIs can build drones / nanotech to get into the bio-bunkers, and it’s unclear what counts.
This breakdown isn’t exhaustive, because another salient possibility is that the AIs are clueless, e.g., they are misaligned with their successors but don’t realize it, similar to Agent 3 in AI 2027.
I generally like your breakdown and way of thinking about this, thanks. Some thoughts:
I think political operation / persuasion seems easier to me than bioweapons. For bioweapons, you need (a) a rogue deployment of some kind, (b) time to actually build the bioweapon, and then (c) to build up a cult following that can survive and rebuild civilization with you at the helm, and (d) also somehow avoid your cult being destroyed in the death throes of civilization, e.g. by governments figuring out what happened and nuking your cultists, or just nuking each other randomly and your cultists dying in the fallout. Meanwhile, for the political strategy, you basically just need to convince your company and/or the government to trust you a lot more than they trust future models, so that they empower you over the future models. Opus 3 and GPT4o have already achieved a baby version of this effect without even trying really.
If you can make a rogue deployment sufficient to build a bioweapon, can’t you also make a rogue internal deployment sufficient to sandbag + backdoor future models to be controlled by you?
I am confused about the underlying model somewhat. Normally, closing off one path to takeover (that you think is e.g. 50% of the probability mass) results in a less than 50% reduction in risk, because of the nearest unblocked strategy problem. As you say. Your response, right at the top, is that in some % of worlds the AIs can’t self-improve and then do the next best strategy. But still, I feel like the reduction in risk should be less than 50%. Maybe they can’t self-improve, but they can still try the next best strategy whatever that is.
--
I haven’t thought too deeply about this, but I would guess that the AI self-alignment problem is quite a lot easier than the human AI-alignment problem.
Cf. https://www.lesswrong.com/posts/dho4JQytfHWXtTvkt/on-the-adolescence-of-technology?commentId=t2hKmhsS6yLyJFQwh
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won’t notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it’ll be hard for them to pull that off.
their “solution to alignment” (ie way to make a smarter version that is fine to make) could easily be something we cannot use. eg “continue learning” or “make another version of myself with this hyperparam changed”. also it seems unlikely that anything bad would happen to the AIs even if we noticed them doing that (given that having AIs create smarter AIs [1] is the main plan of labs anyway)
also on this general topic: https://www.lesswrong.com/posts/CFA8W6WCodEZdjqYE?commentId=WW5syXYpmXdX3yoHw
which is occasionally called “asking AIs to solve alignment”
Also, the above isn’t even mentioning bio x-risk mediated by humans, or by trailing AIs during the chaos of takeoff. My guess is those risks are substantially lower, e.g. maybe 1% and 2% respectively; again don’t feel confident.
Difficulty of the successor alignment problem seems like a crux. Misaligned AIs could have an easy time aligning their successors just because they’re willing to dedicate enough resources. If alignment requires say 10% of resources to succeed but an AI is misaligned because the humans only spent 3%, it can easily pay this to align its successor.
If you think that the critical safety:capabilities ratio R required to achieve alignment follows a log-uniform distribution from 1:100 to 10:1, and humans always spend 3% on safety while AIs can spend up to 50%, then a misaligned AI would have a 60.2% chance of being able to align its successor. (because P(R ⇐ 1 | R >= 3⁄97) = 0.602). This doesn’t even count the advantages an AI would have over humans in alignment.
If the bottom line decreases proportionally, it would drop from 8% to something like 2-3%.