This is a late comment, but extremely impressive work!
I’m a huge fan of explicit, well-argued threat model work, and even more impressive that you made great contributions to mitigation measures already. I find this threat model frankly seemingly more likely to become existential, and possibly at lower AI capability levels, than either yudkowsky/bostrom scenarios or christiano/gradual displacement ones. So seems hugely important!
A question: am I right that most of your analysis presumes that there would be a fair amount of oversight, at least oversight attempts? If so, I’d be afraid that the actual situation might be heavy deployment of agents without much oversight attempts at all (given both labs’ and govts’ safety track record so far). In such a scenario:
How likely do you think collusion attempts aiming for takeover would be?
Could you estimate what kind of capabilities would be needed for a multi-agent takeover?
Would you expect some kind of warning shot before a successful multi-agent takeover or not?
This is a late comment, but extremely impressive work!
I’m a huge fan of explicit, well-argued threat model work, and even more impressive that you made great contributions to mitigation measures already. I find this threat model frankly seemingly more likely to become existential, and possibly at lower AI capability levels, than either yudkowsky/bostrom scenarios or christiano/gradual displacement ones. So seems hugely important!
A question: am I right that most of your analysis presumes that there would be a fair amount of oversight, at least oversight attempts? If so, I’d be afraid that the actual situation might be heavy deployment of agents without much oversight attempts at all (given both labs’ and govts’ safety track record so far). In such a scenario:
How likely do you think collusion attempts aiming for takeover would be?
Could you estimate what kind of capabilities would be needed for a multi-agent takeover?
Would you expect some kind of warning shot before a successful multi-agent takeover or not?