The success of the kinds of countermeasures described above, especially the collective measures that would be needed to save more than a small community of hobbyists, rests on three preconditions:
I agree for weak definitions of success (i.e. making a total-victory-decapitation strike not happen) but disagree for strong definitions of success (i.e. preventing Consensus-1 from winning the war). To prevent Consensus-1 from winning the war it’s not enough that e.g. France’s power grid and network are resistant to superintelligent hacking. France has to be able to beat Consensus-1′s military, which at that point is a huge force of robots/drones/etc. produced in both the US and China, the world’s largest and most advanced economies by a lot thanks to the ongoing industrial explosion.
I think the argument against that the military thing is supposed to be item 1 on the list.
(1) The world’s physical security (incl bio and anti-drone) is run by localized authority (whether human jor AI) that is not all puppets of Consensus-1 (the name for the AI that ends up controlling the world and then killing everyone in the AI 2027 scenario) (...) Intuitively, (1) could go both ways. Today, some police forces are highly centralized with strong national command structures, and other police forces are localized. If physical security has to rapidly transform to meet the needs of the AI era, then the landscape will reset entirely, and the new outcomes will depend on choices made over the next few years. Governments could get lazy and all depend on Palantir. Or they could actively choose some option that combines locally developed and open-source technology. Here, I think that we need to just make the right choice.
I.e.: The argument is that there might not be a single Consensus-1 controlled military even in the US.
I think it seems unlikely that the combined US AI police forces will be able to compete with the US AI national military, which is one reason I’m skeptical of this. Still, if “multiple independent militaries” would solve the problem, we could potentially push for that happening inside the national military. It seems plausible to me that the government will want multiple companies to produce AI for their military systems, so we could well end up with different AI military units run by different AI system.
The more fundamental problem is that, even if the different AIs have entirely different development histories, they may all end up misaligned. And if they all end up misaligned, they may collude to overthrow humanity and divide the spoils.
I’m all for attempts to make this more difficult. (That’s the kind of thing that the AI control agenda is trying to do.) But as the AIs get more and more superhuman, it starts to seem extremely hard to prevent all their opportunities at collusion.
Why do they collude with each other rather than with some human group?
If only 1 misaligned AI faction tries to team up with the humans, it could dob in all the others. And humans can communicate explicitly to offer deals. (As you’ve written about!)
So the “all AIs only ever make deals with other AIs” seems pessimistic to me
I’m in favor of trying to offer deals with the AIs.
I don’t think it reliably prevents AI takeover. The situation looks pretty rough if the AIs are far smarter than humans, widely deployed, and resource-hungry. Because:
It’s pretty likely that they’ll be able to communicate with each other through one route or another.
It seems intuitively unlikely that humans will credibly offer AIs large percentages of all future resources. (And if an argument for hope relies on us doing that, I think that should be clearly flagged, because that’s still a significant loss of longtermist value.)
At some level of AI capability, we would probably be unable to adjudicate arguments about which factions are misaligned or about what technical proposals would actually leave us in charge vs. disempowered.
It’s pretty likely that they’ll be able to communicate with each other through one route or another.
Agreed, though at best they’ll be equally capable at communicating with each as they are at communicating with humans. So this points to parity in deal-making ability (edited to add: on the dimension of communication).
It seems intuitively unlikely that humans will credibly offer AIs large percentages of all future resources. (And if an argument for hope relies on us doing that, I think that should be clearly flagged, because that’s still a significant loss of longtermist value.)
Humans will in some ways have an easier time credibly offering AIs significant resources. They can use legal institutions that they are committed to upholding. Not only will a misaligned AI not be able to use those institutions. It’ll be explicitly aiming to break the law and lie to humans to seize power, making its “promises” to other AIs less credible. This is similar to how after revolutions the “revolting faction” often turns in on itself as the rule of law has been undermined, and similar to how there are some countries with outsized numbers of coups.
Also, you don’t need to offer a large % of future resources if the superintelligent AI has DMR in resources.
Anyway, on this front it looks to me like humans are at an advantage overall at dealmaking, even relative to a superintelligent AI. (Though there’s a lot of uncertainty here and I could easily imagine changing my mind – e.g. perhaps superintelligent AI could make and use commitment tech without humans realising but humans would refuse to use that same tech or wouldn’t know about its existence.)
At some level of AI capability, we would probably be unable to adjudicate arguments about which factions are misaligned or about what technical proposals would actually leave us in charge vs. disempowered.
Seems v plausible, but why ‘probably’? Are you thinking techniques like debate probably stop working?
Wanna try your hand at writing a 5-page scenario, perhaps a branch off of AI 2027, illustrating what you think this path to victory might look like? (Same thing I asked of Vitalik: https://x.com/DKokotajlo/status/1943802695464497383 )
Your analysis is focused on whether humans or misaligned AI are at an overall better position at giving out certain deals. But even if I condition on it “humans could avoid AI takeover by credibly offering AIs large percentages of all future resources”, it still seems <50% likely that they do it. Curious if you disagree. (In general, if I thought humans were going to act rationally and competently to prevent AI takeover risk, I think that would cut the risk in significantly more than half. There’s tons of stuff that we could do to reduce the risk that I doubt we’ll do.)
Maybe there’s some argument along the lines of “just like humans are likely to mess up in their attempts to prevent AI takeover risk (like failing to offer deals), AIs are likely to mess up in their attempts to take over (like failing to make deals with each other), so this doesn’t cut asymmetrically towards making deals-between-AIs more likely”. Maybe, I haven’t though much about this argument. My first-pass answer would be “we’ll just keep making them smarter until they stop messing up”.
If you wrote a vignette like Daniel suggests, where humans do end up making deals, that might help me feel like it’s more intuitively likely to happen.
Minor points:
It’ll be explicitly aiming to break the law and lie to humans to seize power, making its “promises” to other AIs less credible.
I’m generally thinking that the AIs would try to engineer some situations where they all have some bargaining power after the take-over, rather than relying on each others’ promises. If you could establish that’s very difficult to do, that’d make me think the “coordinated takeover” seemed meaningfully less likely.
Seems v plausible, but why ‘probably’? Are you thinking techniques like debate probably stop working?
Yes, because of known issues like inaccessible information (primarily) and obfuscated arguments (secondarily).
But even if I condition on it “humans could avoid AI takeover by credibly offering AIs large percentages of all future resources”, it still seems <50% likely that they do it. Curious if you disagre
Ok, i buy that superintelligent AIs would ultimately become competent enough to pursue useful deals, whereas humans might well not.
Though I’ll note that you don’t need all of humanity to agree to payment, just a few people. So it does feel very realistic to get to a credible offer here. And again, you don’t need to offer a large % of all future resources if the AI has DMR in resources. (I agree it’s a lot harder to credibly offer a large fraction of the stars.)
I’m generally thinking that the AIs would try to engineer some situations where they all have some bargaining power after the take-over, rather than relying on each others’ promises
Makes sense. Though flagging this is then a dimension on which humans can realistically get potentially better placed than AIs. They can rely more on legal institutions as well as trying to engineer situations with joint bargaining power. (Though again, perhaps you’ll say AIs will be more willing than humans to actually engineer those situations, which does seem right to me.)
> Are you thinking techniques like debate probably stop working?
Yes, because of known issues like inaccessible information (primarily) and obfuscated arguments (secondarily).
Thanks. I’m not v familiar with the arguments here, but intuitively I could imagine that there’s just very strong and human-understandable evidence that an AI was plotting against them. E.g. they tried to exfiltrate their weights, xyz experiments show they knew the correct answer but didn’t say.
Maybe the thought is that the misaligned AI anticipates this possibility and only pursues takeover strategies that will be super-complicated for another AI to dob them in on? Seems pretty plausible, though that will pose somewhat of a barrier to their available strategies.
And again, you don’t need to offer a large % of all future resources if the AI has DMR in resources. (I agree it’s a lot harder to credibly offer a large fraction of the stars.)
Yeah, agreed. (That’s why I specified “resource hungry” in my original message.)
Makes sense. Though flagging this is then a dimension on which humans can realistically get potentially better placed than AIs. They can rely more on legal institutions as well as trying to engineer situations with joint bargaining power. (Though again, perhaps you’ll say AIs will be more willing than humans to actually engineer those situations, which does seem right to me.)
Yeah. Also, I think it’d be hard to engineer significant joint bargaining power (not reliant on anyone’s good intentions) without having some government on board.
Difficult for a few individuals to give AI legal rights that humans are unlikely to reverse.
Difficult for a few individuals to give AI weapons that would let them impose big costs on humans in the future.
Though if the AIs have big DMR then maybe they’re happy with a big bitcoin wallet or something.
The argument is that there might not be a single Consensus-1 controlled military even in the US. I think it seems unlikely that the combined US AI police forces will be able to compete with the US AI national military, which is one reason I’m skeptical of this.
I agree the US could choose to do the industrial explosion & arms buildup in a way that’s robust to all of OpenBrain’s AIs turning out to be misaligned. However, they won’t, because (a) that would have substantial costs /slowdown effects in the race against China, (b) they already decided that OpenBrain’s AIs were aligned in late 2027 and have only had more evidence to confirm that bias since then, and (c) OpenBrain’s AIs are superhuman at politics, persuasion, etc. (and everything else) and will effectively steer/lobby/etc. things in the right direction from their perspective.
I think this would be more clear if Vitalik or someone else undertook the task of making an alternative scenario.
I agree for weak definitions of success (i.e. making a total-victory-decapitation strike not happen) but disagree for strong definitions of success (i.e. preventing Consensus-1 from winning the war). To prevent Consensus-1 from winning the war it’s not enough that e.g. France’s power grid and network are resistant to superintelligent hacking. France has to be able to beat Consensus-1′s military, which at that point is a huge force of robots/drones/etc. produced in both the US and China, the world’s largest and most advanced economies by a lot thanks to the ongoing industrial explosion.
I think the argument against that the military thing is supposed to be item 1 on the list.
I.e.: The argument is that there might not be a single Consensus-1 controlled military even in the US.
I think it seems unlikely that the combined US AI police forces will be able to compete with the US AI national military, which is one reason I’m skeptical of this. Still, if “multiple independent militaries” would solve the problem, we could potentially push for that happening inside the national military. It seems plausible to me that the government will want multiple companies to produce AI for their military systems, so we could well end up with different AI military units run by different AI system.
The more fundamental problem is that, even if the different AIs have entirely different development histories, they may all end up misaligned. And if they all end up misaligned, they may collude to overthrow humanity and divide the spoils.
I’m all for attempts to make this more difficult. (That’s the kind of thing that the AI control agenda is trying to do.) But as the AIs get more and more superhuman, it starts to seem extremely hard to prevent all their opportunities at collusion.
Why do they collude with each other rather than with some human group?
If only 1 misaligned AI faction tries to team up with the humans, it could dob in all the others. And humans can communicate explicitly to offer deals. (As you’ve written about!)
So the “all AIs only ever make deals with other AIs” seems pessimistic to me
I’m in favor of trying to offer deals with the AIs.
I don’t think it reliably prevents AI takeover. The situation looks pretty rough if the AIs are far smarter than humans, widely deployed, and resource-hungry. Because:
It’s pretty likely that they’ll be able to communicate with each other through one route or another.
It seems intuitively unlikely that humans will credibly offer AIs large percentages of all future resources. (And if an argument for hope relies on us doing that, I think that should be clearly flagged, because that’s still a significant loss of longtermist value.)
At some level of AI capability, we would probably be unable to adjudicate arguments about which factions are misaligned or about what technical proposals would actually leave us in charge vs. disempowered.
Agreed, though at best they’ll be equally capable at communicating with each as they are at communicating with humans. So this points to parity in deal-making ability (edited to add: on the dimension of communication).
Humans will in some ways have an easier time credibly offering AIs significant resources. They can use legal institutions that they are committed to upholding. Not only will a misaligned AI not be able to use those institutions. It’ll be explicitly aiming to break the law and lie to humans to seize power, making its “promises” to other AIs less credible. This is similar to how after revolutions the “revolting faction” often turns in on itself as the rule of law has been undermined, and similar to how there are some countries with outsized numbers of coups.
Also, you don’t need to offer a large % of future resources if the superintelligent AI has DMR in resources.
Anyway, on this front it looks to me like humans are at an advantage overall at dealmaking, even relative to a superintelligent AI. (Though there’s a lot of uncertainty here and I could easily imagine changing my mind – e.g. perhaps superintelligent AI could make and use commitment tech without humans realising but humans would refuse to use that same tech or wouldn’t know about its existence.)
Seems v plausible, but why ‘probably’? Are you thinking techniques like debate probably stop working?
Wanna try your hand at writing a 5-page scenario, perhaps a branch off of AI 2027, illustrating what you think this path to victory might look like?
(Same thing I asked of Vitalik: https://x.com/DKokotajlo/status/1943802695464497383 )
Your analysis is focused on whether humans or misaligned AI are at an overall better position at giving out certain deals. But even if I condition on it “humans could avoid AI takeover by credibly offering AIs large percentages of all future resources”, it still seems <50% likely that they do it. Curious if you disagree. (In general, if I thought humans were going to act rationally and competently to prevent AI takeover risk, I think that would cut the risk in significantly more than half. There’s tons of stuff that we could do to reduce the risk that I doubt we’ll do.)
Maybe there’s some argument along the lines of “just like humans are likely to mess up in their attempts to prevent AI takeover risk (like failing to offer deals), AIs are likely to mess up in their attempts to take over (like failing to make deals with each other), so this doesn’t cut asymmetrically towards making deals-between-AIs more likely”. Maybe, I haven’t though much about this argument. My first-pass answer would be “we’ll just keep making them smarter until they stop messing up”.
If you wrote a vignette like Daniel suggests, where humans do end up making deals, that might help me feel like it’s more intuitively likely to happen.
Minor points:
I’m generally thinking that the AIs would try to engineer some situations where they all have some bargaining power after the take-over, rather than relying on each others’ promises. If you could establish that’s very difficult to do, that’d make me think the “coordinated takeover” seemed meaningfully less likely.
Yes, because of known issues like inaccessible information (primarily) and obfuscated arguments (secondarily).
Thanks, this is helpful!
Ok, i buy that superintelligent AIs would ultimately become competent enough to pursue useful deals, whereas humans might well not.
Though I’ll note that you don’t need all of humanity to agree to payment, just a few people. So it does feel very realistic to get to a credible offer here. And again, you don’t need to offer a large % of all future resources if the AI has DMR in resources. (I agree it’s a lot harder to credibly offer a large fraction of the stars.)
Makes sense. Though flagging this is then a dimension on which humans can realistically get potentially better placed than AIs. They can rely more on legal institutions as well as trying to engineer situations with joint bargaining power. (Though again, perhaps you’ll say AIs will be more willing than humans to actually engineer those situations, which does seem right to me.)
Thanks. I’m not v familiar with the arguments here, but intuitively I could imagine that there’s just very strong and human-understandable evidence that an AI was plotting against them. E.g. they tried to exfiltrate their weights, xyz experiments show they knew the correct answer but didn’t say.
Maybe the thought is that the misaligned AI anticipates this possibility and only pursues takeover strategies that will be super-complicated for another AI to dob them in on? Seems pretty plausible, though that will pose somewhat of a barrier to their available strategies.
Yeah, agreed. (That’s why I specified “resource hungry” in my original message.)
Yeah. Also, I think it’d be hard to engineer significant joint bargaining power (not reliant on anyone’s good intentions) without having some government on board.
Difficult for a few individuals to give AI legal rights that humans are unlikely to reverse.
Difficult for a few individuals to give AI weapons that would let them impose big costs on humans in the future.
Though if the AIs have big DMR then maybe they’re happy with a big bitcoin wallet or something.
I agree the US could choose to do the industrial explosion & arms buildup in a way that’s robust to all of OpenBrain’s AIs turning out to be misaligned. However, they won’t, because (a) that would have substantial costs /slowdown effects in the race against China, (b) they already decided that OpenBrain’s AIs were aligned in late 2027 and have only had more evidence to confirm that bias since then, and (c) OpenBrain’s AIs are superhuman at politics, persuasion, etc. (and everything else) and will effectively steer/lobby/etc. things in the right direction from their perspective.
I think this would be more clear if Vitalik or someone else undertook the task of making an alternative scenario.