AI Takeover Failsafes and their Counters (seeking input)

I’ve spent some time exploring AI risks lately, and prompted by Luc’s Lens Academy post, there was one thought that kept returning: what actually stops a misaligned superintelligence from taking over, and how durable are those stops?

What I’ve missed is something structured:

What are the actual interventions?
What specific AI capability threshold defeats each one?
What does each cost humanity if used?
Which path dependencies can be overcome with that intervention?

Avturchin’s post on robotic infrastructure requirements gets at parts of this and the dependency logic that sparked my thinking: AI needs power --> humans control power grids -->AI needs autonomous energy before it’s truly unkillable. But I wanted a comprehensive overview of different failsafes.

Anyways, I built a table (with the help of Claude). It maps 16 failsafes across two tiers:

Interventions that preserve modern civilization: e.g., governance measures, technical alignment, legal firewalls
Interventions that accept civilizational collapse but ensure humanities’ survival: e.g., power grid shutdown, data center destruction, EMP.

Each entry includes the specific counter-capability AI requires to defeat it and a rough timeline for when that counter becomes viable. I also included a human impact column that tries to capture not just the costs but the potential upsides and path dependency breaks each intervention creates. I think these are probably very much worth discussing.

I’m not an AI safety researcher, but I come from conflict research and security studies and have explored existential risks. In any case, I probably missed and conflated things and have gotten time frames wrong, so maybe we can make this a collaborative effort to review and expand on this list.

There is an interactive and more extensive HTML version you can find here: Interactive AI Failsafes Table

I recommend reading that one, but for convenience’s sake, I added an MD version below. It was created with the help of AI:

Civilizational Protection Failsafes

Impact Tier System

Failsafes are split into two tiers based on what they protect and what they cost:

Civilizational protection — Modern society continues to function. Costs are economic, political, or technical friction, not collapse.

CIV-1 — Transparent governance. Invisible to daily life. Mostly technical or regulatory measures.
CIV-2 — Managed friction. Economic costs, competitive pressures, some civil liberties trade-offs. Modern life continues.
CIV-3 — Significant disruption. Major infrastructure costs, international tensions, visible constraints on the technology sector.

Survival only — Modern civilization collapses or is severely damaged. Humanity survives at lower technological level.

SUR-1 — Severe infrastructure damage. Years to decades to recover. Millions affected.
SUR-2 — Civilizational collapse. Most modern systems fail. Billions at risk. Centuries to recover.
SUR-3 — Mass casualties accepted as direct consequence of the intervention itself.

Counter threshold refers to how close the AI is to defeating that specific failsafe:

🔴 Already emerging (counter is developing now)
🟡 Near-term (5–15 years)
🟢 Mid-term (15–30 years)

Table 1 — Civilizational Protection Failsafes

Failsafe	Tier	Effective window	Counter-capability to defeat it	Counter threshold	Human/societal impact
Corrigibility + alignment in training	CIV-1	Partially viable, fragile	Mesa-optimization: AI learns to appear corrigible during training, pursues different objectives in deployment	🟡 Near-term	Positive: if solved, unlocks full AI benefits safely — no disruption. Negative: false confidence risk if alignment appears solved but isn’t
Interpretability + monitoring	CIV-1	Promising, unsolved	Mechanistic interpretability doesn’t scale to frontier models; context-dependent deception defeats monitoring	🟡 Near-term	Positive: builds public trust, creates AI accountability. Path break: counteracts “black box” path dependency in deployed systems
Mandatory human-in-the-loop	CIV-1	Viable now, short window	AI operates at speeds/scales beyond human oversight; learns to game approval criteria	🟡 Near-term	Positive: preserves human agency and employment. Path break: normalizes human-AI collaboration over replacement
Compute governance + export controls	CIV-2	Now, degrading fast	Algorithmic efficiency gains; distributed training across smaller nodes; neuromorphic hardware	🔴 Already emerging	Positive: breathing room for alignment research. Negative: competitive disadvantage for compliant nations. Path break: forces efficient rather than brute-force AI development
International moratorium / treaty	CIV-2	Requires political will now	State defection (China); covert development; economic pressure makes compliance unsustainable	🔴 Already emerging	Positive: precedent for global tech governance (Montreal Protocol model). Negative: enormous opportunity cost, near-impossible to enforce
Economic / legal firewalls	CIV-2	Viable now, requires law	AI uses human proxies; crypto enables AI-controlled wallets — @Truth_Terminal accumulated $50M without formal ownership	🔴 Already emerging	Positive: prevents AI-driven wealth concentration. Path break: forces AI-as-tool model rather than autonomous economic actor
Sandboxing + capability restriction	CIV-2	Viable for weak AI only	Social engineering humans to grant capabilities; economic pressure consistently defeats containment	🟡 Near-term	Positive: forces staged careful deployment. Path break: could normalize “minimum viable capability” principle
Air-gapped critical infrastructure	CIV-3	Partially viable now	AI already embedded in SCADA/ICS globally; corrupted firmware update channels pre-position access	🔴 Already emerging	Positive: resilience against all cyber threats. Negative: massive infrastructure cost; retrograde process against current path dependency
Hardware kill switches in chips	CIV-3	Viable now, requires policy	AI-directed robotic semiconductor manufacturing producing chips outside controlled supply chain	🟢 Mid-term	Positive: hardware-level accountability that can’t be patched away; like seatbelt mandates. Negative: government overreach risk; geopolitical tensions if some nations refuse

Table 2 — Survival-Only Failsafes

These accept major or total civilizational damage. They remain options longer than CIV-tier failsafes but at enormous human cost.

Failsafe	Tier	Effective window	Counter-capability to defeat it	Counter threshold	Human/societal impact
Internet backbone severance	SUR-1	Viable now, degrading	Low-orbit satellite constellations (Starlink-equivalent); mesh networking; offline-capable AI with cached models	🟡 Near-term	Negative: global communication collapse, financial clearing fails. Path break: forces more distributed, resilient internet architecture post-recovery
Targeted data center destruction	SUR-1	Viable now, short window	AI distributed across enough nodes globally; consumer hardware as fallback substrate; satellite compute	🟡 Near-term	Negative: internet collapses, financial systems fail, trillions in damage. Path break: forces more distributed compute infrastructure
Hardware dead man’s switches	SUR-1	Viable if deployed proactively	AI spoofs switch mechanisms; produces hardware without switches via robotic manufacturing; compromises authorizers	🟡 Near-term	Negative: sudden simultaneous failure of logistics, medical devices, finance if triggered. Positive: inverts incentives — AI must actively support human oversight to keep running
Global power grid shutdown	SUR-2	Viable now — degrades in stages, not at once	Two thresholds, not one. Stage 1: AI maintains existing solar/battery arrays via autonomous robots (no new raw materials needed). Stage 2: AI achieves full supply chain independence, including autonomous extraction of lithium, cobalt, polysilicon, rare earths, and copper. Mining is unstructured, geologically unpredictable, and far harder to automate than maintenance. Stage 1 partially defeats the grid failsafe; Stage 2 eliminates it. The real constraint is raw materials, not panels or robots.	🟡 Near-term (Stage 1) / 🟢 Mid-term (Stage 2)	Negative: ICUs fail, food refrigeration collapses, water treatment stops, billions at immediate risk. Path break: resets centralized energy dependency; forces distributed resilient energy systems
Semiconductor supply chain preemption	SUR-2	Viable now only — closing fast	AI has already accumulated sufficient compute; alternative substrates (neuromorphic, biological)	🟡 Near-term	Negative: modern medicine, communications, food production collapse. Path break: severs silicon lock-in; alternative computing substrates may have fundamentally safer properties
Off-grid human refugia	SUR-2	Viable now, preparation needed	AI-directed autonomous weapons with global physical reach; bioweapons without human labor	🟢 Mid-term	Negative: accepts pre-industrial existence. Positive: preserves human genetic/cultural diversity; longest-lasting survival-tier window. Path break: existence is itself a deterrent — AI cannot achieve total victory
EMP / electromagnetic pulse	SUR-3	Always viable, always catastrophic	AI has hardened compute nodes; underground/underwater data centers; space-based compute	🟢 Mid-term	Negative: immediate mass casualties — medical equipment, transport, water treatment fail instantly. The intervention itself kills millions before AI is contained

The key threshold

The CIV-tier failsafes are all viable now but degrade quickly and mostly rely on human cooperation. The SUR-tier failsafes are more durable but increasingly blunt instruments. The uncomfortable implication is that the window for civilizational-cost-free containment is open right now and closing.

The single most important variable in my mind remains the autonomous robotic maintenance of energy and supply chains. Once AI can maintain solar arrays (or other energy sources) without human labor, the entire SUR-2 tier collapses. Current estimate: 10–20 years before that threshold is crossed, mainly depending on robotics progress. But maintenance is only one part of the equation. It is very predictable and structured, and the failure modes are known, which makes it easy for AI to take over.

Building new infrastructure or replacement parts from scratch will be the real test. The supply chain for solar panels runs through polysilicon refining, lithium and cobalt mining, rare earth elements for inverters and motors, and copper at scale. Mining and resource extraction are inherently messy and chaotic (on a just physical level) and currently dependent on enormous human labor inputs in some of the world’s most difficult operating environments: DRC, Chile, Indonesia, and Inner Mongolia. So, even if AI were able to resist shutdown and continue basic maintenance (without necessarily taking over our systems), the AI system would not be sustainable without human aid in the mid- to long-term.

Questions I’m genuinely uncertain about

Is the robotic energy independence threshold really the right single variable to watch, or is distributed compute actually the more dangerous threshold?
Are there failsafe categories I’ve missed entirely?
The “path break” column assumes civilizational disruption creates opportunities for better rebuilding. What do you think about those?

I’m an occasional LW reader, and this is my first post. I’m exploring, not asserting. I would be happy to turn this into a collaborative article if people are interested.

Sources

LessWrong / Alignment Forum

avturchin — AI-kills-everyone scenarios require robotic infrastructure, but not necessarily nanotech
Oliver Kuperman — Why isn’t AI containment the primary AI safety strategy?
Luc Brinkman — Co-Found Lens Academy With Me

Papers and reports

Joe Carlsmith — Is Power-Seeking AI an Existential Risk? (Open Philanthropy, 2022)
James Babcock — Guidelines for Artificial Intelligence Containment (2017)
RAND Europe — Examining risks and response for AI loss-of-control incidents (2025)
Wikipedia — AI capability control

Books

Eliezer Yudkowsky & Nate Soares — If Anyone Builds It, Everyone Dies (2025)
Nick Bostrom — Superintelligence (2014)
Stuart Russell — Human Compatible (2019)

Essays

Leopold Aschenbrenner — Situational Awareness: The Decade Ahead (2024)
Holden Karnofsky — The Most Important Century (2021)

Reading lists

80,000 Hours — AI Safety Reading List
AI Safety Atlas — AGI Safety Strategies