The statement you quoted implicitly assumes that work on reducing human takeover won’t affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.
But i don’t think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:
If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher—you get a personal DSA. And the costs of losing are higher—you get completely dominated.
A classic strategy for misaligned AI takeover is “divide and rule”. Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
If someone actually tries to stage a human takeover, i think they’ll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don’t fully understand and haven’t been independently vetted.
Those are pretty high-level points, at the level of abstraction of “actions that reduce human takeover risk”. But it’s much better to evaluate specific mitigations:
Alignment audits reduce human takeover and ai takeover.
Infosecurity against tampering with the weights reduces both risks.
Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
Guardrails and control measures help with both risks.
Clear rules for what Ai should and shouldn’t do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
Making one centralised project would (i’d guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.
Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk—if so, i’d be interested to hear more about why.
Thanks very much for this.
The statement you quoted implicitly assumes that work on reducing human takeover won’t affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.
But i don’t think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:
If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher—you get a personal DSA. And the costs of losing are higher—you get completely dominated.
A classic strategy for misaligned AI takeover is “divide and rule”. Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
If someone actually tries to stage a human takeover, i think they’ll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don’t fully understand and haven’t been independently vetted.
Those are pretty high-level points, at the level of abstraction of “actions that reduce human takeover risk”. But it’s much better to evaluate specific mitigations:
Alignment audits reduce human takeover and ai takeover.
Infosecurity against tampering with the weights reduces both risks.
Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
Guardrails and control measures help with both risks.
Clear rules for what Ai should and shouldn’t do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
Making one centralised project would (i’d guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.
Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk—if so, i’d be interested to hear more about why.