AI ontology crises: an informal typology

(with thanks to Owain Evans)

An ontological crisis happens when an agent’s underlying model of reality changes, such as a Newtonian agent realising it was living in a relativistic world all along. These crises are dangerous if they scramble the agent’s preferences: in the example above, an agent dedicated to maximise pleasure over time could transition to completely different behaviour when it transitions to relativistic time; depending on the transition, it may react by accelerating happy humans to near light speed, or inversely, ban them from moving—or something considerably more weird.

Peter de Blanc has a sensible approach to minimising the disruption ontological crises can cause to an AI, but this post is concerned with analyzing what happens when such approaches fail. How bad could it be? Well, this is AI, so the default is of course: unbelievably, hideously bad (i.e. situation normal). But in what ways exactly?

If the ontological crisis is too severe, the AI may lose the ability to do anything at all, as the world becomes completely incomprehensible to it. This is very unlikely; the ontological crisis was most likely triggered by the AIs own observations and deductions, so it is improbable that it will lose the plot completely in the transition.

A level below that is when the AI can still understand and predict the world, but the crisis completely scrambles its utility function. Depending on how the scrambling happens, this can be safe: the AI may lose the ability to influence the value of its utility function at all. If, for instance, the new utility function assigns wildly different values to distinct states in a chaotic system, the AI’s actions become irrelevant. This might be if different worlds with different microstates but same macrostates get spread evenly across the utility values: unless the AI is an entropy genie, it cannot influence utility values through its decisions, and will most likely become catatonic.

More likely, however, is that the utility function is scrambled to something alien, but still AI-influenceable. Then the AI will still most likely have the convergent instrumental goals of gathering power, influence, pretending to be nice, before taking over when needed. The only saving grace is that its utility function is so bizarre, that we may be able to detect this in some way.

The most dangerous possibility is if the AI’s new utility function resembles the old one, plus a lot of noise (noise from our perspective—from the AIs point of view, it all makes perfect sense). Human values are complex, so this would be the usual unfriendly AI scenario, but making it hard for us to notice the change.

A step below this is when the AI’s new utility function resembles the old one, plus a little bit of noise. Human values remain complex, so this is still most likely an UFAI, but safety precautions built into its utility function (such as AI utility indifference or value learning or similar ideas) may not become completely neutered.

In summary:

Type of crisisNotesDanger
World incomprehensible to AI
Very unlikely None
Utility completely scrambled, AI unable to influence it
Uncertain how likely this is Low
Utility scrambled, AI able to influence it
We may be able to detect change Very High
Lots of noise added to utility
Difficult to detect change Maximal
Some noise added to utility
Small chance of not being so bad, some precautions may remain useful. High