One weird trick to turn maximisers into minimisers

A putative new idea for AI control; index here.

A simple and easy design for a u-maximising agent that turns into a u-minimising one.

Let X be some boolean random variable outside the agent’s control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:

  • u# = (2/​ε)Xu—u.

Before t, the expected value of (2/​ε)X is 2, so u# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.

This isn’t perfect corrigibility—the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:

  • u# = Ω(2/​ε)Xu—u.

If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.