I realized that but I think my counterarguments are true for most organizations who would have a realistic chance of building takeover-level AI. As a case in point, Anthropic launched the Glasswing project trying to fix vulnerabilities rather than saying “great now we can hack into banks for our takeover attempt planned in Q3 2027, let’s not tell anyone about these zero days”.
otto.barten
In your scenario, DARPA would probably not try to take over using AI. It’s not their culture. When they realize their AI achieved DSA, they’d likely voluntarily hand over control to the US government.
Also, DARPA or any other AI project does not operate in a vacuum. Likely, people in government would realize DARPA is on its way to DSA and intervene to take over control before it happens.
We already have a situation where we have democracy with a powerful weapon, namely an atomic bomb. I don’t really see how this will necessarily be different. Oppenheimer also didn’t take over.
As a non-American, I’m worried however that this will decrease power in non-American territories even more (higher permanence and granularity of US sovereignty). Remember that >95% of the world population is not American. All these people would have no say over whatever happens afterwards.
Xrisk counterargument: intelligence needs society to become powerful. Society will only lend itself to society-aligned intelligence.
This is less relevant for lab or govt takeover scenarios where there are some humans cooperating. If the humans are very bad and the takeover permanent, that’s existential too. Most humans are probably not that bad though.
don’t think like us
Just want to flag that not everyone on lesswrong is libertarian or right-wing. Left xriskers are a minority but we exist.
the appetite for conditional risk regulation has been substantially less than the appetite for direct risk regulation
Where do you see the latter appetite?
We campaigned a bit for a conditional treaty. We’d happily sign up for un unconditional pause though. Problem is: there is no appetite for either, right?
I agree that the manpower spent on evals should have been spent on other things with a better theory of change. Eval quality imo is not a crux for regulation, awareness and political support are. I think the money that went to evals should have gone to raising awareness and lobbying.
Honestly why is there still no significant funding for awareness raising projects? It’s so easy: just ask for amount of views/copies and conversion rates measured via e.g. Prolific surveys and fund the most effective projects. A fund like this can easily absorb millions. I think this might actually get regulation off the ground.
UN control. A Baruch plan for AGI.
Good question. First, many of these benchmarks are about things that are dangerous, but not particularly economically valuable (example: bioweapons). My model of labs is that they’re mostly trying to do economically valuable things (as most companies are forced to). Although they may be reckless, I don’t currently have the impression they’re actively trying to take over power, by using things such as bioweapons. Second, some benchmarks are about things that are economically valuable, for example the METR one. But these mostly get benchmaxxed already. Third, we are not creating new benchmarks, but only tracking the scores for existing ones and coupling them to existential threat models. If labs wanted to benchmax any benchmark we track, they could do so with or without our work.
In addition: this risk needs to be weighed against the positive effect of improving the information position of researchers, policymakers, and the public. How much this matters depends on how well we do, but also on your theory of change and to what extent informing these groups is a part of that. In my case, I strongly believe in awareness as a key path towards solving the problem. I think this is true for researcher, policymaker, and public awareness of the right threat model. I’m particularly excited about this graph, showing that public xrisk awareness increased from about 8% to 24% already. If our combined efforts could increase this to tipping point, I’d be mostly optimistic that we can implement and enforce a global AI safety treaty (such as we proposed in TIME and SCMP) and this will reduce xrisk significantly.For these reasons, my bet is that the result is positive. But it’s not obvious, so again, good question.
In addition: I think creating benchmarks that are mainly about something economically relevant, or, god forbid, scientifically relevant, are way more likely to get benchmaxxed and lead straight to a takeover too, while not really having a strong case to reduce xrisk. These benchmarks are routinely created and funded by xrisky orgs.
We appreciate your nitpicks! I’ve added issues on Github.
Agree. Guess granularity will be a function of AI power.
Releasing TakeOverBench.com: a benchmark, for AI takeover
I see the loss of control argument. Regarding the argument of being dominated: to what extent is being dominated by a superpower with ASI different from being dominated by a superpower with nuclear weapons, conventional military dominance, and economic dominance, which is the current situation for many middle powers? I can imagine that post-ASI, control granularity might be higher, and permanence might be higher. How important will these differences be?
Interesting, yeah tend to agree. Doesn’t really change the argument though right? One could make the same argument for an AI that’s further out in space and mobilizes sufficient resources to create a DSA.
Thank you for writing the post, interesting to think about.
Suppose an AI has a perfect world model, but no “I”, that is, no indexical information. Then a bad actor comes along and asks the AI “please take over the world for me”. Its guardrails removed (which is routinely done for open source models), the AI complies.
Its takeover actions will look exactly like those of a rogue AI. Only difference is, the rogue part doesn’t stem from the AI itself, but from the bad actor. For everyone except the bad actor, though, the result looks exactly the same. The AI, using its perfect world model and other dangerous capabilities, takes over the world and, if the bad actor chooses so, kills everyone.
This is fairly close to my central threat model. I don’t care much whether the adverse action comes from a self-aware AI or a bad actor, I care about the world being taken over. For this threat model, I would have to conclude that removing indexing from a model does not make it much safer. In addition, someone, somewhere, will probably include the indexing that was carefully removed.
I think this is philosophically interesting but as long as we will get open source models, we should assume maximum adversarial ones, and focus mostly on regulation (hardware control) to reduce takeover risk.
This same argument, imo, implies to other alignment work, including mechinterp, and to control work.
I could be persuaded by a positive offense defense balance for takeover threat models to think otherwise (I currently think it’s probably negative).
I agree with these two points, but I doubt either will have significant impact.
Human extinction is seen as extremely unlikely, almost absurd. While there are obviously many other public concerns, three other significant human extinction concerns are very rare.
We also asked for people’s extinction probability, but then many would just give a number (sometimes high), even if they didn’t see the existential risk of AI at all. Still, trends in both methodologies were usually similar.
I’m open to better methodologies, but I think this is a fair way of assessing public xrisk awareness, and a better way than asking explicit probabilities.
24% of the US public is now aware of AI xrisk
I don’t think your first point is obvious. We’ve had super smart humans (e.g. with IQ >200) and they haven’t been able to take over the world. (Although they didn’t have many of the advantages an AI might have, such as mass copying themselves over the internet.)
In general, the power(intelligence) curve is a big crux for me that we can’t fill in with data points yet (of course intelligence is also spiky). Imo we also have no idea where takeover-level intelligence is, what takeover-shape intelligence is, and what maximum AI would be.
What do you mean by soft limit?
I don’t think it makes sense to be confidently optimistic about this (the offense defense balance) given the current state of research. I looked into this topic some time ago with Sammy Martin. I think there is very little plan of anyone in the research community on how the blue team would actually stop the red team. Particularly worrying is that several domains look like the offense has the advantage (eg bioweapons, cybersec), and that defense would need to play by the rules, hugely hindering its ability to act. See also eg this post.
Since most people who actually thought about this seem to arrive at the conclusion that offense would win, I think being confident that defense would win seems off. What are your arguments?
Currently, we observe that leading models get open sourced roughly half a year later. It’s not a stretch to assume this will also happen to takeover-level AI. If we assume such AI to look like LLM agents, it would be relevant to know what the probability is that such an agent, somewhere on earth, would try to take over.
Let’s assume someone, somewhere, will be really annoyed with all the safeguards and remove them, so that their LLM will have a probability of 99% to just do as it’s told, even though that might be highly unethical. Let’s furthermore assume an LLM-based agent will need to take 20 unethical actions to actually take over (the rest of the required actions won’t look particularly unethical to low-level LLMs executing them, in our scenario). In this case, there would be a 99%^20=82% chance that an LLM-based agent would take over, for any bad actor giving it this prompt.
I’d be less worried if it would be extremely difficult, and require lots of resources, to get LLMs to take unethical actions when they’re asked to. For example, if jailbreaking safety would be highly robust, and even adversarial fine tuning of open source LLMs wouldn’t break it.
Is that something you see on the horizon?
I mean, it would be perfectly intent-aligned, it carries out its orders to the letter. Only problem is, carrying out its orders involves a takeover. So no, I don’t mean its own goal, but a goal someone gave it.
I guess it’s a bit different in the sense that instrumental convergence states that all goals will lead to power-seeking subgoals. This statement is less strong, it just says that some goals will lead to power-seeking behaviour.
Argument against recursive self-improvement: you need algorithms, compute, and data for AI. Self-improvement only works on the algorithms.
Maybe self-improvement works but only up until a ceiling determined by compute and data that may be << superintelligence.