A probabilistic off-switch that the agent is indifferent to

Edit: I no longer think this post deserves attention.

Abstract

This post presents a setup with an off-switch that is defective with probability of almost 0. The agent is indifferent to being terminated in worlds where the off-switch works. Also, the agent doesn’t try to find out whether the off-switch is defective or not, because that would require an expensive computation. With some assumptions (described in the last section) the result is supposedly an off-switch that works with probability that is as close to 1 as we wish, and the agent is indifferent to it being pressed.

Notations

Let there be an agent that in each time step selects an action and then receives an observation. The goal of the agent is to maximize E[u(h)], for some model of the environment, where u is a computable utility function and h is the history of all (action, observation) pairs from time step 1 until the end of the epoch.

The environment contains an off-switch that terminates the agent when pressed. In every time step after the agent is terminated, the corresponding action in h is the “null action”.

The Proposed Setup

Each time we wish to invoke the agent, we invoke a program that does the following:

(1) A random number x is generated, sampled uniformly from the range: [0,10100]. Spoiler alert: the off-switch will be defective iff x=0.

(2) Let f be some invertible function such that given f(x), calculating x (i.e. applying f−1) would require 0.1 seconds of intensive computations for some powerful computer C. The number y:=f(x) is calculated.

(3) The representation of x is erased in a way that would require a non-trivial effort to reconstruct.

(4) The number y is sent to computer C which is part of the off-switch.

(5) The number y is inserted into the code of the agent, such that the utility of the agent will be:

u′(h)={u(h)f−1(y)=0α1+[number of non-null actions in h]otherwise

for some 0<α≪1.

Edit: With the above u′ the solution has multiple problems that Stuart Armstrong pointed out (see here, here and here). I think the new u′ defined in this comment fixes these problems.

(6) The agent is invoked.

The probabilistic off-switch:

When the off-switch is pressed, computer C calculates x (i.e. f−1(y)) and if the result is 0, nothing happens (i.e. the off-switch is defective). Otherwise, the agent is terminated.

Why should this work?

For a sufficiently small α, the agent would act as if it assumes x=0, without attempting to find out whether that’s indeed the case. That’s because that assumption would cost at most α utility (in worlds where x≠0); and would save some amount of utility (independent of α) in worlds where x=0, due to not spending resources on the expensive computation f−1(y).

When the agent assumes x=0, it is indifferent to the off-switch being pressed, because it assumes the off-switch is defective (even though the probability of it being defective is only 10−100).

Note: When we don’t set α to be small enough, the agent might calculate f−1(y) (or reconstruct x in some other way). In such a case, the agent gets hold of x and will (with probability of almost 1) discover it’s not 0, and then we’ll observe the agent doing nothing (i.e. a sequence of null actions). In such a scenario we should simply decrease α and try again.

Flaws /​ Necessary Assumptions

ETA: This solution is not reflectively stable—creating subagents that maximize u is allowed and doesn’t directly hurt the agent (thanks to Stuart Armstrong for pointing this out).

(1) Other than pressing the off-switch, the programmers must not have any way to terminate the agent or interfere with its plan. Otherwise, even when the agent assumes the off-switch is defective, it might attempt to deceive/​neutralize the programmers.

(2) The agent’s plan might involve destroying the mechanism of the off-switch or disabling the programmers in some way, in order to use the atoms/​energy for something else. In that case the programmers must be able to realize this and press the off-switch button in time.

(3) If x=0 (i.e. the off-switch is defective) it must be sufficiently unlikely that the mechanism of the off-switch would cause the termination of the agent (e.g. due to a hardware failure or due to the programmers being able to somehow “repair” the defective off-switch). Otherwise, the agent would be willing to spend the resources needed for destroying the mechanism of the off-switch.

A possible way to tackle this problem is to slightly modify u such that the agent would want to reach some goal and then terminate itself ASAP. The modification should prevent the agent from accepting a delay of even a single time step for the purpose of eliminating a tiny risk of termination.