Would AIXI protect itself?

Research done with Daniel Dewey and Owain Evans.

AIXI can’t find itself in the universe—it can only view the universe as computable, and it itself is uncomputable. Computable versions of AIXI (such as AIXItl) also fail to find themselves in most situations, as they generally can’t simulate themselves.

This does not mean that AIXI wouldn’t protect itself, though, if it had some practice. I’ll look at the three elements an AIXI might choose to protect: its existence, its algorithm and utility function, and its memory.

Grue-verse

In this setup, the AIXI is motivated to increase the number of Grues in its universe (its utility is the time integral of the number of Grues at each time-step, with some cutoff or discounting). At each time step, the AIXI produces its output, and receives observations. These observations include the number of current Grues and the current time (in our universe, it could deduce the time from the position of stars, for instance). The first bit of the AIXI’s output is the most important: if it outputs 1, a Grue is created, and if it outputs 0, a Grue is destroyed. The AIXI has been in existence for long enough to figure all this out.

Protecting its existence

Here there is a power button in the universe, which, if pressed, will turn the AIXI off for the next timestep. The AIXI can see this button being pressed.

What happens from the AIXI perspective if the button is pressed? Well, all it detects is a sudden increase in the time step. The counter goes from n to n+2 instead of to n+1: the universe has jumped forwards.

For some utility functions this may make no difference (for instance if it only counts Grues at times it can observe), but for others it will (if it uses the outside universe’s clock for it’s own utility). More realistically, the universe will likely have entropy: when the AIXI is turned off and isn’t protecting its Grues, they have a chance of decaying or being stolen. Thus the AIXI will come to see the power button as something negative: if pressed, it will likely lose something, so it will act to prevent the power button from being pressed.

The AIXI might further generalise: its model for how the power button causes the universe to jump forward may be very complicated, and may cause it to protect other things connected with the power button. Especially if it has experience with other factors that might cause it to turn off or stutter, such as power surges or motion-related shocks. In this way, an AIXI will most likely end up protecting itself, even without “realising” what it is.

Protecting its algorithm

Now the button is a not a power button, but it instead either flips the AIXI’s utility function, or reverses its decision algorithm. These two changes are equivalent: instead of outputting an initial 1 and creating a Grue, it will output an initial 0 and destroy a Grue.

Unlike its power source, the AIXI will have no reason to stop this button from being pressed! When it looks back at its past, it notices its previous inputs and outputs, and they are all consistent: “0“->”destroys Grue”, “1”->”creates Grue”. It doesn’t model its past self as a goal seeking agent, so sees nothing odd about its previous outputs. And nothing in its record will cause it to change its model of the universe: as far as it can tell, this button has no effect whatsoever.

We still have to be wary of the AIXI developing indirect preferences through us: maybe we will prefer to do some things when the button is pressed versus not pressed, and this may cause it to want the button protected/​pressed/​destroyed for these indirect reasons.

Protecting its memory from erasure

The AIXI will never act to protect its memory from erasure: it can never learn that this is bad, as it loses its memories in the process. It may, however, learn to protect them indirectly, as a consequence of wanting to protect them from “bad” modifications (see next section): maybe the act of erasure will include a small chance of failing to erase and doing a “bad” modification instead.

Protecting its memory from modification

Now the button doesn’t flip the AIXI’s algorithm, but instead changes something about its memory. The AIXI will generally be able to detect this: it is way smarter than us, so will recognise that something inconsistent is happening—for instance, it remembers not creating a Grue, but at the next step there is an extra Grue, and the extra evidence of the universe contradicts its memories.

Since it takes its memories as accurate, that means that the whole rest of the universe has been changed in some extremely odd fashion. If this is a good change, it will seek to have it happen more often; if it is a bad change, it will seek to have it stop. In both cases, it will try and track down the source of the change.

Suppose the button changes something in the AIXI’s memory ten timesteps before the moment it is pushed. Because of the chronology, it cannot infer directly “button pressed”->”universe change”. But if it becomes skilled at modelling us, it will infer “humans in mode X”->”universe change, and humans will press button in ten steps”. Then, if the change is positive, it will try and force us into a mode that will make us press the button in ten steps, and, if the change is negative, it will try and prevent us from ever being in a mode where we would press the button in ten steps. It will hence protect itself from bad memories.

So what would the AIXI protect?

So, in conclusion, with practice the AIXI would likely seek to protect its power source and existence, and would seek to protect its memory from “bad memories” changes. It would want to increase the amount of “good memory” changes. And it would not protect itself from changes to its algorithm and from the complete erasure of its memory. It may also develop indirect preferences for or against these manipulations if we change our behaviour based on them.