Dead-switches as AI safety tools

Time estimate: 1 minute read
+ Additional arguments for ca. 6 minutes

Some obvious technical ideas for improving AI safety are largely overlooked, or at least not discussed enough. I claim that the concept of dead-switches for AI control is one of them.

A dead man’s switch, a dead-switch, is originally meant as a technical device that triggers if human operators become incapacitated. It is a fail-safe. The famous example is the Soviet “dead hand” nuclear deterrent.

In AI terms, it is a safety mechanism that works like a kill switch, but triggers automatically as default, in case of compromised system integrity. You don’t have to detect and react to a threat.

When you view the concept of a dead-switch broadly and out of its original meaning, it kind of overlaps with other things, like tripwires. But tripwires monitor for specific red flags. They may never trigger unless specific conditions are met. A dead-switch on the other hand, should always trigger, unless everything goes as planned. If nobody checks in and verifies that all is well, it should fire.

This idea is simple and straightforward. And yet, it also allows for flexibility and modification. It’s a concept that can be baked into AI safety work in various ways.

I think that serious AI safety experts should have serious discussions about dead-switches.

That’s the case I am making. My arguments are found further below (optional reading).

Expanding the idea: I further believe that this concept can be integrated in practice, both on a technical level, and on a governance level. On a governance level, it is less important to keep the original, technical meaning intact. The point is more to have safety measures to fall back on if things go wrong.

If you want, you can stop reading here. The main point is made.

“Technical detail”

Potential targets for technical AI dead-switches: Power, memory, APIs, data encoding functionalities, central compute.

How to: How do you build a technical dead-switch?

In many ways, depending on what you target. Hardware implementations are safer than software ones, but you can stack and layer these things into a more comprehensive safety design.

Here is one simple idea: Hook up a mechanical device with a mechanical timer. This controls an electric switch. Make sure power for a critical function (like a GPU cluster) flows through this switch. If the timer is not reset by a human operator, the switch will shut down the power circuit.

Real talk: Technical considerations and cost-benefit tables of various solutions is not in scope of this post. I’m aiming for awareness and discussion.

I will make my case below though.

General arguments for dead-switch dynamics to be taken seriously (in the context of AI)

A few people occasionally point out this idea, especially those who have realized that proper AI governance should be complemented with technical measures. See for example this LW post by Mau, AI Governance Needs Technical Work.

I also make the case for more technical work early on in my post Proactive AI Control: A Case for Battery-Dependent Systems, where I put this dead-switch idea into the notion of relying on battery power.

Honestly, I am sure many here can make really strong cases for (and against) this idea, given various circumstances. But I don’t really see any serious cases against it, not on LW or on the alignment forum.

If any veteran control/​safety experts read this, please let us know what you think, especially why you think this is not discussed more.

My top-four arguments

Here are my top-three arguments as to why I think dead-switches would be an improvement to AI safety.

  1. We are already discussing kill-switches anyway.

    Eliezer for one, has recently said in connection to his “book-promoting tour”*, that one thing we can do while others are building superintelligence, is to build an off-switch.

    *If anyone builds it everyone dies, by Nate Soares and Eliezer Yudkowsky

    Skeptics may say that technical measures like these, if implemented carelessly, are false friends that can inspire false sense of security. Therefore, it is better to actually solve alignment problems.

    I agree that false sense of security is a risk, and I agree that implementation, as always, is the tricky part. I still think the idea is highly valuable.

    First of all, alignment may well be impossible for superintelligences, whereas control solutions for their progenitors are not. (A powerful SI, unlike its early progenitors, can realistically circumvent any human-made dead-switch.)

    Secondly, control work does not prevent alignment work. Do both.

    2. Dead-switches are common in other kinds of critical technology, like nuclear power and transportation.

    Enough said.

    3. Some dead-switch solutions may be easier to promote to key figures, than many other safety investments.

    Why do I think this?

    Because, those investors, regulators and company developers that believe the AI models they are backing will succeed—or those who are just not afraid of AI in general—are not expecting the dead-switch(es) to be triggered anyway. And if they are triggered carelessly, they can just be reset.

    In fact, putting dead-switches on AIs could even help companies regain control of rogue agents, early on, and without other drastic measures.

    Timed switches can be used for this purpose, or switches that require several parameters to coincide in order to be triggered.

    Examples:

    1. Say that you have a hardware-enforced, encryption-locked dead-switch, which will shut down all processing/​memory functions for your AI model in training, apart for say the bare minimum that avoids permanently harming the model weights.

      But this will only trigger if it is not reset within one week.

      Then the only reason as to why it is not reset, is because the people with access are unable or unwilling to reset it.

      For an AI that suddenly hijacks control, this measure will, in theory, put an end to the rebellion after a given time.

      b. A subtly misaligned AI in deployment gradually deteriorates to a tipping point. It is now about to reach criticality. It is about to go crazy and start spewing out shocking and dangerous content, including info-hazards.

      After enough early indicators flag that the AI is about going haywire, an independent dead-switch triggers, and automatically cuts off all API-access to regular users. The company then does damage control on the backend, after realizing what happened.

      4. You can apply this concept broadly in political governance.

      Consider the suggestion that all leading AI laboratories in the US must pause their operations and undergo targeted state auditing every year. How likely is it, that OpenAI, Deep Mind, xAI, Anthropic, Meta, and anyone else, will agree with this policy upfront?

      We know the answer. They don’t.

      But how likely is it that they will all agree that they all must undergo auditing within a month, on condition that a major incident does take place in one of the companies?

      Let’s say, that if major incident X happens, and it is on the pre-approved list Y of contingencies, then this rule will trigger. The incident-company must furthermore halt operations until the audit is completed.

      This setup is a more viable policy push. And an auditing process that only applies to the sinner is of course even more likely to be approved.

      It is also much more likely that list Y can contain items that the companies would not agree to work proactively on mitigating, but which they can be convinced to include as triggers, should they happen. Especially if the consider the events unlikely to happen anyway.

I will make one more point here. Adding a small threat of punishment to any agreement makes it more likely to be upheld. This is classic game theory.

A dead-switch mechanism, whether in a physical form or as a legal mechanism, can be powerful regulation tool here, because it works like as a stick too. If you don’t watch out, it will fire. But if you behave it won’t, and people will praise you for taking safety seriously.

We need to discuss built-in safety more

My stance on AI safety right now is quite simple: Reactive solutions are inferior, and the most reactive solutions are those that never get implemented in time to make any difference. In general, any intrinsic safety thinking is good, if for nothing else, then because it sets an example that this is how we should think.

What do you think about the concept of dead-switches? Do you think technical dead-switches should be applied to existing systems? Why, or why not? Do you think the idea has a place in governance? If you are working on AI safety, let us know.

Oftentimes, obvious solutions are completely valid, even if the problem is complex. We won’t know unless we consider them carefully.