But even if human takeover is 1⁄10 as bad as AI takeover, the case for working on AI-enabled coups is strong due to its neglectedness.
The logic for this doesn’t check out, I think? If human takeover is 1⁄10 as bad as AI takeover, and human takeover pre-empts AI takeover (because it ends the race and competitiveness dynamics giving rise to most of the risk of AI takeover), then a human takeover might be the best thing that could happen to humanity. This makes the case for working on AI-enabled coups particularly weak.
If by 1/10th as bad we mean “we lose 10% of the value of the future, as opposed to ~100% of the value of the future” then increasing marginal probability of human takeover seems great as long as you assign >10% probability to AI takeover[1], which I think most people who have thought a lot about AI risk do.
And you expect the risk to be uncorrelated, i.e. human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans
I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn’t necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.
Related to that evidential update: I would disagree that “human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans”. I think it disproportionately reduces the probability of “no takeover from either AI or small groups of humans”. Because I think it’s likely that, if a human attempts an AI-enabled coup, they would simultaneously make it very easy for misaligned systems to seize power on their own behalf. (Because they’ll have to trust the AI to seize power, and they can’t easily use humans to control the AIs at the same time, because most humans are opposed to their agenda.) So if the AIs don’t take over on their own behalf, and instead gives the power back to the coup-leader, I think that suggests that alignment was going pretty well, and that AI takeover would’ve been pretty unlikely either way.
But here’s something I would agree with: If you think human takeover is only 1/10th as bad as AI takeover, you have to be pretty careful about how coup-preventing interventions affect the probability of AI takeover when analyzing whether they’re overall good. I think this is going to vary a lot between different interventions. (E.g. one thing that could happen is that you could get much more alignment audits because the govt insists on them as a measure of protecting against human-led coups. That’d be great. But I think other interventions could increase AI takeover risk.)
I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn’t necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.
I don’t currently agree that the remaining AI takeover risk would be much lower:
The international race seems like a big deal. Ending the domestic race is good, but I’d still expect reckless competition I think. Maybe you’re imagining that a large chunk of powergrabs are motivated by stopping the race? I’m a bit sceptical.
I don’t think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.
There’s another evidential update which I think is much stronger, which is that the world has completely dropped the ball on an important thing almost no one wants (powergrabs), where there are tractable things they could have done, and some of those things would directly reduce AI takeover risk (infosec, alignment audits etc). In a world where coups over the US are possible, I expect we’ve failed to do basic alignment stuff too.
The international race seems like a big deal. Ending the domestic race is good, but I’d still expect reckless competition I think.
I was thinking that AI capabilities must already be pretty high by the time an AI-enabled coup is possible. If one country also had a big lead, then probably they would soon have strong enough capabilities to end the international race too. (And the fact that they were willing to coup internally is strong evidence that they’d be willing to do that.)
But if the international race is very tight, that argument doesn’t work.
I don’t think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.
Yeah, I suppose. I think this gets into definitional issues about what counts as AI takeover and what counts as human takeover.
For example: If, after the coup, the AIs are ~guaranteed to eventually come out on top, and they’re just temporarily using the human leader (who believe themselves to be in charge) because it’s convenient for international politics — does that count as human takeover or AI takeover?
If it counts as “AI takeover”, then my argument would apply. (Saying that “AI takeover” would be much less likely after successful “human takeover”, but also that “human takeover” mostly takes probability mass from worlds where takeover wasn’t going to happen.)
If it counts as “human takeover”, then my argument would not apply, and “AI takeover” would be pretty likely to happen after a temporary “human takeover”.
The practical upshot for how much “human takeover” ultimately reduces the probability of “AI takeover” would be the same.
The statement you quoted implicitly assumes that work on reducing human takeover won’t affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.
But i don’t think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:
If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher—you get a personal DSA. And the costs of losing are higher—you get completely dominated.
A classic strategy for misaligned AI takeover is “divide and rule”. Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
If someone actually tries to stage a human takeover, i think they’ll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don’t fully understand and haven’t been independently vetted.
Those are pretty high-level points, at the level of abstraction of “actions that reduce human takeover risk”. But it’s much better to evaluate specific mitigations:
Alignment audits reduce human takeover and ai takeover.
Infosecurity against tampering with the weights reduces both risks.
Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
Guardrails and control measures help with both risks.
Clear rules for what Ai should and shouldn’t do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
Making one centralised project would (i’d guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.
Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk—if so, i’d be interested to hear more about why.
The logic for this doesn’t check out, I think? If human takeover is 1⁄10 as bad as AI takeover, and human takeover pre-empts AI takeover (because it ends the race and competitiveness dynamics giving rise to most of the risk of AI takeover), then a human takeover might be the best thing that could happen to humanity. This makes the case for working on AI-enabled coups particularly weak.
If by 1/10th as bad we mean “we lose 10% of the value of the future, as opposed to ~100% of the value of the future” then increasing marginal probability of human takeover seems great as long as you assign >10% probability to AI takeover[1], which I think most people who have thought a lot about AI risk do.
And you expect the risk to be uncorrelated, i.e. human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans
That’s a good point.
I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn’t necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.
Related to that evidential update: I would disagree that “human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans”. I think it disproportionately reduces the probability of “no takeover from either AI or small groups of humans”. Because I think it’s likely that, if a human attempts an AI-enabled coup, they would simultaneously make it very easy for misaligned systems to seize power on their own behalf. (Because they’ll have to trust the AI to seize power, and they can’t easily use humans to control the AIs at the same time, because most humans are opposed to their agenda.) So if the AIs don’t take over on their own behalf, and instead gives the power back to the coup-leader, I think that suggests that alignment was going pretty well, and that AI takeover would’ve been pretty unlikely either way.
But here’s something I would agree with: If you think human takeover is only 1/10th as bad as AI takeover, you have to be pretty careful about how coup-preventing interventions affect the probability of AI takeover when analyzing whether they’re overall good. I think this is going to vary a lot between different interventions. (E.g. one thing that could happen is that you could get much more alignment audits because the govt insists on them as a measure of protecting against human-led coups. That’d be great. But I think other interventions could increase AI takeover risk.)
I don’t currently agree that the remaining AI takeover risk would be much lower:
The international race seems like a big deal. Ending the domestic race is good, but I’d still expect reckless competition I think. Maybe you’re imagining that a large chunk of powergrabs are motivated by stopping the race? I’m a bit sceptical.
I don’t think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.
There’s another evidential update which I think is much stronger, which is that the world has completely dropped the ball on an important thing almost no one wants (powergrabs), where there are tractable things they could have done, and some of those things would directly reduce AI takeover risk (infosec, alignment audits etc). In a world where coups over the US are possible, I expect we’ve failed to do basic alignment stuff too.
Curious what you think.
I was thinking that AI capabilities must already be pretty high by the time an AI-enabled coup is possible. If one country also had a big lead, then probably they would soon have strong enough capabilities to end the international race too. (And the fact that they were willing to coup internally is strong evidence that they’d be willing to do that.)
But if the international race is very tight, that argument doesn’t work.
Yeah, I suppose. I think this gets into definitional issues about what counts as AI takeover and what counts as human takeover.
For example: If, after the coup, the AIs are ~guaranteed to eventually come out on top, and they’re just temporarily using the human leader (who believe themselves to be in charge) because it’s convenient for international politics — does that count as human takeover or AI takeover?
If it counts as “AI takeover”, then my argument would apply. (Saying that “AI takeover” would be much less likely after successful “human takeover”, but also that “human takeover” mostly takes probability mass from worlds where takeover wasn’t going to happen.)
If it counts as “human takeover”, then my argument would not apply, and “AI takeover” would be pretty likely to happen after a temporary “human takeover”.
The practical upshot for how much “human takeover” ultimately reduces the probability of “AI takeover” would be the same.
Thanks very much for this.
The statement you quoted implicitly assumes that work on reducing human takeover won’t affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.
But i don’t think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:
If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher—you get a personal DSA. And the costs of losing are higher—you get completely dominated.
A classic strategy for misaligned AI takeover is “divide and rule”. Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
If someone actually tries to stage a human takeover, i think they’ll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don’t fully understand and haven’t been independently vetted.
Those are pretty high-level points, at the level of abstraction of “actions that reduce human takeover risk”. But it’s much better to evaluate specific mitigations:
Alignment audits reduce human takeover and ai takeover.
Infosecurity against tampering with the weights reduces both risks.
Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
Guardrails and control measures help with both risks.
Clear rules for what Ai should and shouldn’t do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
Making one centralised project would (i’d guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.
Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk—if so, i’d be interested to hear more about why.