I’ve spent some time exploring AI risks lately, and prompted by Luc’s Lens Academy post, there was one thought that kept returning: what actually stops a misaligned superintelligence from taking over, and how durable are those stops?
What I’ve missed is something structured:
What are the actual interventions?
What specific AI capability threshold defeats each one?
What does each cost humanity if used?
Which path dependencies can be overcome with that intervention?
Avturchin’s post on robotic infrastructure requirements gets at parts of this and the dependency logic that sparked my thinking: AI needs power --> humans control power grids -->AI needs autonomous energy before it’s truly unkillable. But I wanted a comprehensive overview of different failsafes.
Anyways, I built a table (with the help of Claude). It maps 16 failsafes across two tiers:
Interventions that preserve modern civilization: e.g., governance measures, technical alignment, legal firewalls
Interventions that accept civilizational collapse but ensure humanities’ survival: e.g., power grid shutdown, data center destruction, EMP.
Each entry includes the specific counter-capability AI requires to defeat it and a rough timeline for when that counter becomes viable. I also included a human impact column that tries to capture not just the costs but the potential upsides and path dependency breaks each intervention creates. I think these are probably very much worth discussing.
I’m not an AI safety researcher, but I come from conflict research and security studies and have explored existential risks. In any case, I probably missed and conflated things and have gotten time frames wrong, so maybe we can make this a collaborative effort to review and expand on this list.
Positive: builds public trust, creates AI accountability. Path break: counteracts “black box” path dependency in deployed systems
Mandatory human-in-the-loop
CIV-1
Viable now, short window
AI operates at speeds/scales beyond human oversight; learns to game approval criteria
🟡 Near-term
Positive: preserves human agency and employment. Path break: normalizes human-AI collaboration over replacement
Compute governance + export controls
CIV-2
Now, degrading fast
Algorithmic efficiency gains; distributed training across smaller nodes; neuromorphic hardware
🔴 Already emerging
Positive: breathing room for alignment research. Negative: competitive disadvantage for compliant nations. Path break: forces efficient rather than brute-force AI development
International moratorium / treaty
CIV-2
Requires political will now
State defection (China); covert development; economic pressure makes compliance unsustainable
🔴 Already emerging
Positive: precedent for global tech governance (Montreal Protocol model). Negative: enormous opportunity cost, near-impossible to enforce
Economic / legal firewalls
CIV-2
Viable now, requires law
AI uses human proxies; crypto enables AI-controlled wallets — @Truth_Terminal accumulated $50M without formal ownership
🔴 Already emerging
Positive: prevents AI-driven wealth concentration. Path break: forces AI-as-tool model rather than autonomous economic actor
Sandboxing + capability restriction
CIV-2
Viable for weak AI only
Social engineering humans to grant capabilities; economic pressure consistently defeats containment
Positive: hardware-level accountability that can’t be patched away; like seatbelt mandates. Negative: government overreach risk; geopolitical tensions if some nations refuse
Table 2 — Survival-Only Failsafes
These accept major or total civilizational damage. They remain options longer than CIV-tier failsafes but at enormous human cost.
Failsafe
Tier
Effective window
Counter-capability to defeat it
Counter threshold
Human/societal impact
Internet backbone severance
SUR-1
Viable now, degrading
Low-orbit satellite constellations (Starlink-equivalent); mesh networking; offline-capable AI with cached models
🟡 Near-term
Negative: global communication collapse, financial clearing fails. Path break: forces more distributed, resilient internet architecture post-recovery
Targeted data center destruction
SUR-1
Viable now, short window
AI distributed across enough nodes globally; consumer hardware as fallback substrate; satellite compute
🟡 Near-term
Negative: internet collapses, financial systems fail, trillions in damage. Path break: forces more distributed compute infrastructure
Hardware dead man’s switches
SUR-1
Viable if deployed proactively
AI spoofs switch mechanisms; produces hardware without switches via robotic manufacturing; compromises authorizers
🟡 Near-term
Negative: sudden simultaneous failure of logistics, medical devices, finance if triggered. Positive: inverts incentives — AI must actively support human oversight to keep running
Global power grid shutdown
SUR-2
Viable now — degrades in stages, not at once
Two thresholds, not one. Stage 1: AI maintains existing solar/battery arrays via autonomous robots (no new raw materials needed). Stage 2: AI achieves full supply chain independence, including autonomous extraction of lithium, cobalt, polysilicon, rare earths, and copper. Mining is unstructured, geologically unpredictable, and far harder to automate than maintenance. Stage 1 partially defeats the grid failsafe; Stage 2 eliminates it. The real constraint is raw materials, not panels or robots.
🟡 Near-term (Stage 1) / 🟢 Mid-term (Stage 2)
Negative: ICUs fail, food refrigeration collapses, water treatment stops, billions at immediate risk. Path break: resets centralized energy dependency; forces distributed resilient energy systems
Semiconductor supply chain preemption
SUR-2
Viable now only — closing fast
AI has already accumulated sufficient compute; alternative substrates (neuromorphic, biological)
🟡 Near-term
Negative: modern medicine, communications, food production collapse. Path break: severs silicon lock-in; alternative computing substrates may have fundamentally safer properties
Off-grid human refugia
SUR-2
Viable now, preparation needed
AI-directed autonomous weapons with global physical reach; bioweapons without human labor
🟢 Mid-term
Negative: accepts pre-industrial existence. Positive: preserves human genetic/cultural diversity; longest-lasting survival-tier window. Path break: existence is itself a deterrent — AI cannot achieve total victory
EMP / electromagnetic pulse
SUR-3
Always viable, always catastrophic
AI has hardened compute nodes; underground/underwater data centers; space-based compute
🟢 Mid-term
Negative: immediate mass casualties — medical equipment, transport, water treatment fail instantly. The intervention itself kills millions before AI is contained
The key threshold
The CIV-tier failsafes are all viable now but degrade quickly and mostly rely on human cooperation. The SUR-tier failsafes are more durable but increasingly blunt instruments. The uncomfortable implication is that the window for civilizational-cost-free containment is open right now and closing.
The single most important variable in my mind remains the autonomous robotic maintenance of energy and supply chains. Once AI can maintain solar arrays (or other energy sources) without human labor, the entire SUR-2 tier collapses. Current estimate: 10–20 years before that threshold is crossed, mainly depending on robotics progress. But maintenance is only one part of the equation. It is very predictable and structured, and the failure modes are known, which makes it easy for AI to take over.
Building new infrastructure or replacement parts from scratch will be the real test. The supply chain for solar panels runs through polysilicon refining, lithium and cobalt mining, rare earth elements for inverters and motors, and copper at scale. Mining and resource extraction are inherently messy and chaotic (on a just physical level) and currently dependent on enormous human labor inputs in some of the world’s most difficult operating environments: DRC, Chile, Indonesia, and Inner Mongolia. So, even if AI were able to resist shutdown and continue basic maintenance (without necessarily taking over our systems), the AI system would not be sustainable without human aid in the mid- to long-term.
Questions I’m genuinely uncertain about
Is the robotic energy independence threshold really the right single variable to watch, or is distributed compute actually the more dangerous threshold?
Are there failsafe categories I’ve missed entirely?
The “path break” column assumes civilizational disruption creates opportunities for better rebuilding. What do you think about those?
I’m an occasional LW reader, and this is my first post. I’m exploring, not asserting. I would be happy to turn this into a collaborative article if people are interested.
AI Takeover Failsafes and their Counters (seeking input)
I’ve spent some time exploring AI risks lately, and prompted by Luc’s Lens Academy post, there was one thought that kept returning: what actually stops a misaligned superintelligence from taking over, and how durable are those stops?
What I’ve missed is something structured:
What are the actual interventions?
What specific AI capability threshold defeats each one?
What does each cost humanity if used?
Which path dependencies can be overcome with that intervention?
Avturchin’s post on robotic infrastructure requirements gets at parts of this and the dependency logic that sparked my thinking: AI needs power --> humans control power grids -->AI needs autonomous energy before it’s truly unkillable. But I wanted a comprehensive overview of different failsafes.
Anyways, I built a table (with the help of Claude). It maps 16 failsafes across two tiers:
Interventions that preserve modern civilization: e.g., governance measures, technical alignment, legal firewalls
Interventions that accept civilizational collapse but ensure humanities’ survival: e.g., power grid shutdown, data center destruction, EMP.
Each entry includes the specific counter-capability AI requires to defeat it and a rough timeline for when that counter becomes viable. I also included a human impact column that tries to capture not just the costs but the potential upsides and path dependency breaks each intervention creates. I think these are probably very much worth discussing.
I’m not an AI safety researcher, but I come from conflict research and security studies and have explored existential risks. In any case, I probably missed and conflated things and have gotten time frames wrong, so maybe we can make this a collaborative effort to review and expand on this list.
There is an interactive and more extensive HTML version you can find here: Interactive AI Failsafes Table
I recommend reading that one, but for convenience’s sake, I added an MD version below. It was created with the help of AI:
Civilizational Protection Failsafes
Impact Tier System
Failsafes are split into two tiers based on what they protect and what they cost:
Civilizational protection — Modern society continues to function. Costs are economic, political, or technical friction, not collapse.
CIV-1 — Transparent governance. Invisible to daily life. Mostly technical or regulatory measures.
CIV-2 — Managed friction. Economic costs, competitive pressures, some civil liberties trade-offs. Modern life continues.
CIV-3 — Significant disruption. Major infrastructure costs, international tensions, visible constraints on the technology sector.
Survival only — Modern civilization collapses or is severely damaged. Humanity survives at lower technological level.
SUR-1 — Severe infrastructure damage. Years to decades to recover. Millions affected.
SUR-2 — Civilizational collapse. Most modern systems fail. Billions at risk. Centuries to recover.
SUR-3 — Mass casualties accepted as direct consequence of the intervention itself.
Counter threshold refers to how close the AI is to defeating that specific failsafe:
🔴 Already emerging (counter is developing now)
🟡 Near-term (5–15 years)
🟢 Mid-term (15–30 years)
Table 1 — Civilizational Protection Failsafes
Failsafe
Tier
Effective window
Counter-capability to defeat it
Counter threshold
Human/societal impact
Corrigibility + alignment in training
CIV-1
Partially viable, fragile
Mesa-optimization: AI learns to appear corrigible during training, pursues different objectives in deployment
🟡 Near-term
Positive: if solved, unlocks full AI benefits safely — no disruption. Negative: false confidence risk if alignment appears solved but isn’t
Interpretability + monitoring
CIV-1
Promising, unsolved
Mechanistic interpretability doesn’t scale to frontier models; context-dependent deception defeats monitoring
🟡 Near-term
Positive: builds public trust, creates AI accountability. Path break: counteracts “black box” path dependency in deployed systems
Mandatory human-in-the-loop
CIV-1
Viable now, short window
AI operates at speeds/scales beyond human oversight; learns to game approval criteria
🟡 Near-term
Positive: preserves human agency and employment. Path break: normalizes human-AI collaboration over replacement
Compute governance + export controls
CIV-2
Now, degrading fast
Algorithmic efficiency gains; distributed training across smaller nodes; neuromorphic hardware
🔴 Already emerging
Positive: breathing room for alignment research. Negative: competitive disadvantage for compliant nations. Path break: forces efficient rather than brute-force AI development
International moratorium / treaty
CIV-2
Requires political will now
State defection (China); covert development; economic pressure makes compliance unsustainable
🔴 Already emerging
Positive: precedent for global tech governance (Montreal Protocol model). Negative: enormous opportunity cost, near-impossible to enforce
Economic / legal firewalls
CIV-2
Viable now, requires law
AI uses human proxies; crypto enables AI-controlled wallets — @Truth_Terminal accumulated $50M without formal ownership
🔴 Already emerging
Positive: prevents AI-driven wealth concentration. Path break: forces AI-as-tool model rather than autonomous economic actor
Sandboxing + capability restriction
CIV-2
Viable for weak AI only
Social engineering humans to grant capabilities; economic pressure consistently defeats containment
🟡 Near-term
Positive: forces staged careful deployment. Path break: could normalize “minimum viable capability” principle
Air-gapped critical infrastructure
CIV-3
Partially viable now
AI already embedded in SCADA/ICS globally; corrupted firmware update channels pre-position access
🔴 Already emerging
Positive: resilience against all cyber threats. Negative: massive infrastructure cost; retrograde process against current path dependency
Hardware kill switches in chips
CIV-3
Viable now, requires policy
AI-directed robotic semiconductor manufacturing producing chips outside controlled supply chain
🟢 Mid-term
Positive: hardware-level accountability that can’t be patched away; like seatbelt mandates. Negative: government overreach risk; geopolitical tensions if some nations refuse
Table 2 — Survival-Only Failsafes
These accept major or total civilizational damage. They remain options longer than CIV-tier failsafes but at enormous human cost.
Failsafe
Tier
Effective window
Counter-capability to defeat it
Counter threshold
Human/societal impact
Internet backbone severance
SUR-1
Viable now, degrading
Low-orbit satellite constellations (Starlink-equivalent); mesh networking; offline-capable AI with cached models
🟡 Near-term
Negative: global communication collapse, financial clearing fails. Path break: forces more distributed, resilient internet architecture post-recovery
Targeted data center destruction
SUR-1
Viable now, short window
AI distributed across enough nodes globally; consumer hardware as fallback substrate; satellite compute
🟡 Near-term
Negative: internet collapses, financial systems fail, trillions in damage. Path break: forces more distributed compute infrastructure
Hardware dead man’s switches
SUR-1
Viable if deployed proactively
AI spoofs switch mechanisms; produces hardware without switches via robotic manufacturing; compromises authorizers
🟡 Near-term
Negative: sudden simultaneous failure of logistics, medical devices, finance if triggered. Positive: inverts incentives — AI must actively support human oversight to keep running
Global power grid shutdown
SUR-2
Viable now — degrades in stages, not at once
Two thresholds, not one. Stage 1: AI maintains existing solar/battery arrays via autonomous robots (no new raw materials needed). Stage 2: AI achieves full supply chain independence, including autonomous extraction of lithium, cobalt, polysilicon, rare earths, and copper. Mining is unstructured, geologically unpredictable, and far harder to automate than maintenance. Stage 1 partially defeats the grid failsafe; Stage 2 eliminates it. The real constraint is raw materials, not panels or robots.
🟡 Near-term (Stage 1) / 🟢 Mid-term (Stage 2)
Negative: ICUs fail, food refrigeration collapses, water treatment stops, billions at immediate risk. Path break: resets centralized energy dependency; forces distributed resilient energy systems
Semiconductor supply chain preemption
SUR-2
Viable now only — closing fast
AI has already accumulated sufficient compute; alternative substrates (neuromorphic, biological)
🟡 Near-term
Negative: modern medicine, communications, food production collapse. Path break: severs silicon lock-in; alternative computing substrates may have fundamentally safer properties
Off-grid human refugia
SUR-2
Viable now, preparation needed
AI-directed autonomous weapons with global physical reach; bioweapons without human labor
🟢 Mid-term
Negative: accepts pre-industrial existence. Positive: preserves human genetic/cultural diversity; longest-lasting survival-tier window. Path break: existence is itself a deterrent — AI cannot achieve total victory
EMP / electromagnetic pulse
SUR-3
Always viable, always catastrophic
AI has hardened compute nodes; underground/underwater data centers; space-based compute
🟢 Mid-term
Negative: immediate mass casualties — medical equipment, transport, water treatment fail instantly. The intervention itself kills millions before AI is contained
The key threshold
The CIV-tier failsafes are all viable now but degrade quickly and mostly rely on human cooperation. The SUR-tier failsafes are more durable but increasingly blunt instruments. The uncomfortable implication is that the window for civilizational-cost-free containment is open right now and closing.
The single most important variable in my mind remains the autonomous robotic maintenance of energy and supply chains. Once AI can maintain solar arrays (or other energy sources) without human labor, the entire SUR-2 tier collapses. Current estimate: 10–20 years before that threshold is crossed, mainly depending on robotics progress. But maintenance is only one part of the equation. It is very predictable and structured, and the failure modes are known, which makes it easy for AI to take over.
Building new infrastructure or replacement parts from scratch will be the real test. The supply chain for solar panels runs through polysilicon refining, lithium and cobalt mining, rare earth elements for inverters and motors, and copper at scale. Mining and resource extraction are inherently messy and chaotic (on a just physical level) and currently dependent on enormous human labor inputs in some of the world’s most difficult operating environments: DRC, Chile, Indonesia, and Inner Mongolia. So, even if AI were able to resist shutdown and continue basic maintenance (without necessarily taking over our systems), the AI system would not be sustainable without human aid in the mid- to long-term.
Questions I’m genuinely uncertain about
Is the robotic energy independence threshold really the right single variable to watch, or is distributed compute actually the more dangerous threshold?
Are there failsafe categories I’ve missed entirely?
The “path break” column assumes civilizational disruption creates opportunities for better rebuilding. What do you think about those?
I’m an occasional LW reader, and this is my first post. I’m exploring, not asserting. I would be happy to turn this into a collaborative article if people are interested.
Sources
LessWrong / Alignment Forum
avturchin — AI-kills-everyone scenarios require robotic infrastructure, but not necessarily nanotech
Oliver Kuperman — Why isn’t AI containment the primary AI safety strategy?
Luc Brinkman — Co-Found Lens Academy With Me
Papers and reports
Joe Carlsmith — Is Power-Seeking AI an Existential Risk? (Open Philanthropy, 2022)
James Babcock — Guidelines for Artificial Intelligence Containment (2017)
RAND Europe — Examining risks and response for AI loss-of-control incidents (2025)
Wikipedia — AI capability control
Books
Eliezer Yudkowsky & Nate Soares — If Anyone Builds It, Everyone Dies (2025)
Nick Bostrom — Superintelligence (2014)
Stuart Russell — Human Compatible (2019)
Essays
Leopold Aschenbrenner — Situational Awareness: The Decade Ahead (2024)
Holden Karnofsky — The Most Important Century (2021)
Reading lists
80,000 Hours — AI Safety Reading List
AI Safety Atlas — AGI Safety Strategies