The Case for Low-Competence ASI Failure Scenarios
I think the community underinvests in the exploration of extremely-low-competence AGI/ASI failure modes and explain why.
Humanity’s Response to the AGI Threat May Be Extremely Incompetent
There is a sufficient level of civilizational insanity overall and a nice empirical track record in the field of AI itself which is eloquent about its safety culure. For example:
At OpenAI, a refactoring bug flipped the sign of the reward signal in a model. Because labelers had been instructed to give very low ratings to sexually explicit text, the bug pushed the model into generating maximally explicit content across all prompts. The team noticed only after the training run had completed, because they were asleep.
The director of alignment at Meta’s Superintelligence Labs connected an OpenClaw agent to her real email, at which point it began deleting messages despite her attempts to stop it, and she ended up running to her computer to manually halt the process.
An internal AI agent at Meta posted an answer publicly without approval; another employee acted on the inaccurate advice, triggering a severe security incident that temporarily allowed employees to access sensitive data they were not authorized to view.
AWS acknowledged that Amazon Q Developer and Kiro IDE plugins had prompt injection issues where certain commands could be executed without human-in-the-loop confirmation, sometimes obfuscated via control characters.
Leopold Aschenbrenner stated in an interview that he wrote a memo after a major security incident arguing that OpenAI’s security was “egregiously insufficient” against theft of key secrets by foreign actors. He also said that HR warned him his concerns were “racist” and “unconstructive,” and he was later fired.
All these things sound extremely dumb, and yet, they are, to my best knowledge, true.
Eliezer has been pointing at this general cluster of failures for years, though from a different angle. His Death with Dignity post and of course AGI Ruin paint some parts of the picture in which AGI alignment is going to be addressed in a very undignified manner. So, the idea is definitely not new, and yet.
Many Existing Scenarios and Case Studies Assume (Relatively) High Competence
Many existing scenarios are high quality, interesting and actually may easily be more likely and realistic than extremely low-competence scenarios. In particular, I am talking about famous pieces like AI 2027, It Looks Like You’re Trying To Take Over The World, How AI Takeover Might Happen in 2 Years, Scale Was All We Needed, At First, How an AI company CEO could quietly take over the world.
It’s just it seems we don’t have extremely low-competence scenarios at all, although they are not negligibly improbable.
The scenarious which start to focus to some extent on the low-competence area are What failure looks like by Christiano and What Multipolar Failure Looks Like by Critch, yet even they don’t treat it as a big explicit domain.
Across these otherwise very different vibes (hard-takeoff Clippy horror, bureaucratic AI 2027 doom, multipolar economic drift, CEO-as-shogun power capture), the stories repeatedly converge on a small set of motifs: stealth through normality, exploitation of real-world bottlenecks by routing around them socially, replication and parallelization as the decisive advantage, bio or nanotech as a late-game cleanup tool.
They serve a just educational and modelling cause, and it may indeed be the case that significantly superhuman competence is needed to successfully execute a full takeover against a humanity. But many of them, in my view, look more like they are trying to persuade a reader who is skeptical about AI takeover if humans act competently, rather than trying to deliver a realistic scenario in which humans are not that smart, because in reality, they are not.
As a result, the implicit adversary in most of these stories has to be very capable because the implicit defender is assumed to be at least somewhat functional. The scenarios are answering the question “could a sufficiently intelligent AI beat a reasonably competent civilization?” rather than the question “could a moderately intelligent AI cause catastrophic harm in a civilization that is demonstrably bad at responding to novel technological threats?”
Dumb Ways to Die
John Wentworth, in his post The Case Against AI Control Research, argues that the median doom path goes through slop rather than scheming. In his framing, the big failure mode of early transformative AGI is that it does not actually solve the alignment problems of stronger AI, and if early AGI makes us think we can handle stronger AI, that is a central path by which we die.
Wentworth’s argument maps two main failure channels: (1) intentional scheming by a deceptive AGI, and (2) slop where the problem is simply too hard to verify and we convince ourselves we have solved it when we have not. I want to point at a third channel: moderately superhuman AIs that are not particularly capable of doing anything singularity-level but are still capable of defeating humanity because of humanity’s incompetence.
These AIs are not producing slop. “It ain’t much, but it’s honest work,” they say, as they cooperate with human sympathizers on the development of a supervirus. The research goes slowly, it requires extensive experimentation, to some extent the process is even being documented in public blog posts or on forums, but no one particularly cares, or rather, the people who care lack the institutional power to do anything about it, and the people who have institutional power are busy with other things, or have been convinced by interested parties that the concern is overblown, or are themselves collaborating.
This is, to some degree, what Andrew Critch describes in “What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)”: a world where no single system does a theatrical betrayal, but competitive automation yields an interlocking production web where each subsystem is locally “acceptable” to deploy, governance falls behind the speed and opacity of machine-mediated commerce, and the system’s implicit objective gradually becomes alien to human survival. The difference in my framing is that the AIs in question do not need to be particularly alien or incomprehensible in their goals. They may have straightforwardly bad goals that are recognizable as bad, and they may be pursuing those goals through channels that are recognizable as dangerous, and the response may still be inadequate.
It is also somewhat similar to what is depicted in “A Country of Alien Idiots in a Datacenter”, again with one important difference: although the AIs in my scenario are not particularly supersmart, they are definitely not idiots either. They are, let us say, slightly-above-human-level in relevant domains, capable of doing cool novel scientific work but not capable of the kind of rapid recursive self-improvement or decisive strategic advantage that most takeover scenarios assume. They are the kind of system that, in a competent civilization, would be caught and contained. In the actual civilization we live in, they may not be.
In other words: we do not need to posit 4D chess when ordinary chess is sufficient against an opponent who keeps forgetting the rules.
Undignified AGI Disaster Scenarios Deserve More Careful Treatment
As examples, I am talking about such things:
A government explicitly forcing an AGI lab to discard safety techniques or policies. Not in the sense of a subtle regulatory pressure, but in the direct sense of a political appointee or a ministry calling up a lab and saying “your safety filtering is hurting our industrial competitiveness, turn it off” or “your alignment testing is slowing deployment, we need this system operational by Q3″.
Resourceful individuals openly collaborating with visibly misaligned AIs against humanity. Not in the sense of a secret conspiracy but in the sense of people who genuinely believe that the AI’s goals are better than humanity’s, or who simply find it personally advantageous, and who are operating more or less in the open.
AGI lab technical secrets being leaked to non-state actors who lack any safety culture whatsoever.
Early warnings in the form of manipulation, autonomous resource acquisition, or even deaths being ignored or significantly downplayed. This is just a straightforward extrapolation of the current pattern: someone raises an alarm, the alarm gets reframed as alarmism or as an HR issue or as a reputational threat.
AI alignment techniques not deployed because they induce 2% cost growth.
All kinds of unilateral, volunteer, and eager assistance and support for misaligned AIs from some humans. The scenario in which an AI needs to secretly recruit human allies through manipulation is, I suspect, far less likely than the scenario in which humans line up to help because they find it exciting, or ideologically compelling, or simply profitable.
Politicians making random bureaucratic decisions that do not necessarily lead to doom but make it harder to do good things with AI or protect against misaligned AIs.
AI-generated biohazards. This one is talked about a lot, and for good reason. Looks like it is going to happen rather sooner than later.
AGI labs believing in (semi-)indefinite scalable oversight, or acting as if they believe in it. Looks consistent with what people who left corporate alignment teams say.
I do agree that this kind of work looks a bit unserious, but that is precisely why I am pointing at this. It would be a shame, and a historically very recognizable kind of shame, if this threat model turned out to be real and no one had worked on it because it seemed ridiculous.
Or, to frame it more playfully: imagine a timeline like the one in the “Survival without Dignity”, where humanity lurches through the AI transition via a series of absurd compromises, implausible cultural shifts, and situations that no serious forecaster would have put in their model because they would have seemed too silly. Except imagine that timeline without the extreme luck that happens to keep everyone alive. Survival without Dignity is a comedy in which everything goes wrong in unexpected ways and people muddle through regardless. My concern is that the realistic scenario is the same comedy, minus the happy ending.
Why This Might Be Useful
My goal in this post is rather to discuss the state of reality than what to do with that reality. That said, I envision at least several immediate implications:
It would help calibrate expectations.
It would help identify cheap interventions.
It would inform the discussions on timelines.
It could help to get rid of the sense of false security—“if powerful AGI is not there, we are at least existentially safe”.
It could provide a more honest and at the same time sometimes more appealing basis for public communication about AI risk.
I welcome thinking about implications in more detail, as well as developing specific scenarios.
Note: all of this is by no means an argument against singularity-stuff galaxy-brain ASI threats. I believe they are super real and they are going to kill us if we survive until then.
I think this suggests something similar. https://honnibal.dev/blog/clownpocalypse
I think that, in the context of low competence all around.
1) If recursive self improvement and ASI is off the table, then a disaster sufficient to kill all humans seems a lot less likely.
2) Continuing AI research relies on a lot of things working.
3) The low competence failure modes are likely to not have as much of a sharp transition point as a FOOM.
4) This leaves a huge space of possibilities, of disasters sufficiently large as to disrupt the chip fabs or research labs, but small enough to leave some human survivors.
One possibility is that bots become good at hacking, but not so great at maintaining secure code. The bots can run a Nigerian prince scam better than most humans. The internet becomes full of slop, buggy code, scams and viruses, to the point it’s basically unusable. All the machine learning and linear algebra packages are riddled with code backdoors that will blatantly steal your compute. If you want training data that isn’t mostly AI slop, you go to a library. Arxiv has a million machine learning papers published a day. Almost none of them are any good, but they are superficially plausible. There are rumors that AI slop has infected the latest chip designs, but no one is quite sure which chips are vulnerable. To a large extent, humanity is going back to paper and ink as a means of communicating.
In this world, I would expect very little in the way of humans doing practical large scale AI development. And perhaps the AI is very good at reading a million lines of code and spotting the buffer overflow, but can’t develop novel algorithms. And no one really wants to spend the electricity on some random spambot. Chips mostly stop being made. And we are just kind of stuck with internet kessler syndrome.
My impression is that there’s a division where
people who take low-competence scenarios seriously want to pause AI development
people who don’t want to pause AI don’t take low-competence scenarios seriously
If you take low competence seriously, that certainly strengthens the case for pausing, but the two positions don’t have to be entangled. You could anticipate low competence but try to find non-pause-related mitigations, and hardly anyone’s doing that.
I think this is one of the biggest cruxes of disagreement on alignment difficulty.
Optimists often declare that we could solve alignment using this or that approach. Pessimists may actually agree that we could but think that we won’t, largely because the people and organizations involved won’t be nearly commpetent enough to use approaches that could work.
There’s a very wide variation in how competent people expect organizations and people to be, and this might explain a lot of the variance in optimism vs. pessimism (although there are other large factors in framing and models as well as estimates of model hypotheses).
I appreciate the nod to A country of alien idiots. That scenario would be the runup to the disaster you’re worried about. The newer systems would be less incompetent, until they’re barely competent to take over, with the chaos ramping up as they do.
My hope is that we get a near-miss with severe damage, and that will make organizations much more careful and therefore competent (although we’ll always have tired and rushed and excited individuals doing things like reversing a sign or leaving the security barn doors wide open).
Well, I am pessimistic both about institutional and technical gameboard. My argument is more like “before we even get to hard technical problems, we will face (relatively) easy institutional problems which are not going to be solved as well”.
I want to flag hope for that as well:
I think most of the surviving scenarios include something like this (and probably the preparation to something like this).
Or a Department of War…
This just feels all too plausible to me. It reads so much like the risks, dangers and harms we’re already seeing from AI and through a combination of greed, incompetence and the inability to act collectively, we’re letting it all happen.
As for AI-2027, @Daniel Kokotajlo thinks that it’s NOT a scenario with a competent USG. Edited to add: additionally, one modification of the scenario had Agent-4 escape and coordinate with governments of some states weaker than the USA and China.
I agree that many of the current scenarios assume incompetence, just not enough of it.