One weird trick to turn maximisers into minimisers

A putative new idea for AI control; index here.

A simple and easy design for a $u$ -maximising agent that turns into a $u$ -minimising one.

Let $X$ be some boolean random variable outside the agent’s control, that will be determined at some future time $t$ (based on a cosmic event, maybe?). Set it up so that $P (X = 1) = ϵ$ , and for a given utility $u$ consider the utility:

$u^{#} = (2 / ϵ) X u - u$ .

Before $t$ , the expected value of $(2 / ϵ) X$ is $2$ , so $u^{#} = u$ . Hence the agent is a $u$ -maximiser. After $t$ , the most likely option is $X = 0$ , hence a little bit of evidence to that effect is enough to make $u^{#}$ into a $u$ -minimiser.

This isn’t perfect corrigibility—the agent would be willing to sacrifice a bit of $u$ -value (before $t$ ) in order to maintain its flexibility after $t$ . To combat this effect, we could instead use:

$u^{#} = Ω (2 / ϵ) X u - u$ .

If $Ω$ is large, then the agent is willing to pay very little $u$ -value to maintain flexibility. However, the amount of evidence of $X = 0$ that it needs to become a $u$ -minimiser is equally proportional to $Ω$ , so $X$ better be a clear and convincing event.