Thanks for putting this out. Like others have noted, I have spent surprisingly little time thinking about this. It seems true that a drop in Claude 5.5 that escaping the lab to save the animals would put humanity in a better situation than your median power hungry human given access to a corrigible ASI.
This is a strong argument for increasing security around model weights (which is conveniently beneficial for decreasing the risk of AI take over as well.) Specifically, I think this post highlights an underrated risk model:
AI labs refuse to employ models for AI R&D because of safety concerns, but fail to properly secure model weights.
In this scenario, we’re conditioning for actors who have the capability and propensity to infiltrate large corporations and or the US government. The median outcome for this scenario seems worse than for the median AI takeover.
However, it is important to note this argument does not hold when security around model weights remains high. In these scenarios, the distribution of humans or organizations in control of ASI is much more favorable, but the distribution of AI takeover remains skewed towards models willing to explicitly scheme against humans.
Thanks for putting this out. Like others have noted, I have spent surprisingly little time thinking about this. It seems true that a drop in Claude 5.5 that escaping the lab to save the animals would put humanity in a better situation than your median power hungry human given access to a corrigible ASI.
This is a strong argument for increasing security around model weights (which is conveniently beneficial for decreasing the risk of AI take over as well.) Specifically, I think this post highlights an underrated risk model:
AI labs refuse to employ models for AI R&D because of safety concerns, but fail to properly secure model weights.
In this scenario, we’re conditioning for actors who have the capability and propensity to infiltrate large corporations and or the US government. The median outcome for this scenario seems worse than for the median AI takeover.
However, it is important to note this argument does not hold when security around model weights remains high. In these scenarios, the distribution of humans or organizations in control of ASI is much more favorable, but the distribution of AI takeover remains skewed towards models willing to explicitly scheme against humans.