Crux is whether or not agents that are actually capable of quick takeover are compute-bound enough that the threat is essentially unipolar (i.e.; only capable of living in a handful of datacenters, in the hands of a few corporate actors or nation-states), and thus somewhat containable. This is how we get “Toddler Shoggoth in a prison cell”. This ties into beliefs about how agent capabilities will scale, which is why it’s my crux.
(Although this begs the question of why a sufficiently powerful unipolar agent wouldn’t immediately attempt takeover anyway—answer is that either: 1 Rational agent will be highly risk-averse towards any action that might cause a blowback resulting in curtailment or shutdown, and thus must be 100% certain takeover attempt will succeed. Efforts to obtain certainty (i.e.; extensive pentesting and planning) are themselves detection risks. Therefore, human persuasion is a tactic that cheaply mitigates risk of blowback to more overt takeover attempts. 2. Or, less likely, we have sufficient OpSec that we are able to contain the agent, making human persuasion the only viable path forward).
FWIW, I don’t believe that agents are currently capable of a takeover that wouldn’t also risk detection and a coordinated human response / change in political attitudes towards AI, making the payoff matrix sufficiently lousy that the agents wouldn’t try it unless specifically directed to. On the other hand, if it can influence the human environment to be favorable to takeover and unfavorable to human vigilance and control, it neutralizes the threat of attitudes changing rather cheaply. Willing to be convinced otherwise.
Unipolarity is about characteristic time to takeover vs. to emergence of worthy rivals. Currently multiple AI companies are robustly within months of each other in capabilities. So an AI can only be in a unipolar situation if it can disarm the other AI companies before they get similarly capable AIs, that is within months. Superpersuasion might be too slow for that on its own (unless it also manages to manipulate the relevant governments), though it could be a step in a larger plan that escalates to something else.
I think superpersuasion (even in milder senses) would in principle be sufficient for takeover on its own if there was enough time, because it could direct the world towards a gradual disempowerment path. Since there isn’t enough time, there needs to be a second step that enables a faster takeover to preserve unipolarity, and superpersuasion would still be helpful in getting its creator AI company to play along with the second step. But the issue with many possibilities for this second step is that the AI doesn’t necessarily have the option of recursive self-improvement to advance its own capabilities, because the AI might be unable to quickly develop smarter AIs that are aligned with it.
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent—others actors will run out of money before they can create enough compute to rival it.
If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.
Crux is whether or not agents that are actually capable of quick takeover are compute-bound enough that the threat is essentially unipolar (i.e.; only capable of living in a handful of datacenters, in the hands of a few corporate actors or nation-states), and thus somewhat containable. This is how we get “Toddler Shoggoth in a prison cell”. This ties into beliefs about how agent capabilities will scale, which is why it’s my crux.
(Although this begs the question of why a sufficiently powerful unipolar agent wouldn’t immediately attempt takeover anyway—answer is that either: 1 Rational agent will be highly risk-averse towards any action that might cause a blowback resulting in curtailment or shutdown, and thus must be 100% certain takeover attempt will succeed. Efforts to obtain certainty (i.e.; extensive pentesting and planning) are themselves detection risks. Therefore, human persuasion is a tactic that cheaply mitigates risk of blowback to more overt takeover attempts. 2. Or, less likely, we have sufficient OpSec that we are able to contain the agent, making human persuasion the only viable path forward).
FWIW, I don’t believe that agents are currently capable of a takeover that wouldn’t also risk detection and a coordinated human response / change in political attitudes towards AI, making the payoff matrix sufficiently lousy that the agents wouldn’t try it unless specifically directed to. On the other hand, if it can influence the human environment to be favorable to takeover and unfavorable to human vigilance and control, it neutralizes the threat of attitudes changing rather cheaply. Willing to be convinced otherwise.
Unipolarity is about characteristic time to takeover vs. to emergence of worthy rivals. Currently multiple AI companies are robustly within months of each other in capabilities. So an AI can only be in a unipolar situation if it can disarm the other AI companies before they get similarly capable AIs, that is within months. Superpersuasion might be too slow for that on its own (unless it also manages to manipulate the relevant governments), though it could be a step in a larger plan that escalates to something else.
I think superpersuasion (even in milder senses) would in principle be sufficient for takeover on its own if there was enough time, because it could direct the world towards a gradual disempowerment path. Since there isn’t enough time, there needs to be a second step that enables a faster takeover to preserve unipolarity, and superpersuasion would still be helpful in getting its creator AI company to play along with the second step. But the issue with many possibilities for this second step is that the AI doesn’t necessarily have the option of recursive self-improvement to advance its own capabilities, because the AI might be unable to quickly develop smarter AIs that are aligned with it.
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent—others actors will run out of money before they can create enough compute to rival it.
If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.