Self-Reference Breaks the Orthogonality Thesis

One core obstacle to AI Alignment is the Orthogonality Thesis. The Orthogonality Thesis is usually defined as follows: “the idea that the final goals and intelligence levels of artificial agents are independent of each other”. More careful people say “mostly independent” instead. Stuart Armstrong qualifies the above definition with “(as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence)”.

Does such a small exception matter? Yes it does.

The exception is broader than Stuart Armstrong makes it sound. The Orthogonality Thesis does not just apply to any goal which refers to an agent’s intelligence level. It refers to any goal which refers even to a component of the agent’s intelligence machinery.

If you’re training an AI to optimize an artificially constrained external reality like a game of chess or Minecraft then the Orthogonality Thesis applies in its strongest form. But the Orthogonality Thesis cannot ever apply in full to the physical world we live in.

A world-optimizing value function is defined in terms of the physical world. If a world-optimizing AI is going to optimize the world according to a world-optimizing value function then the world-optimizing AI must understand the physical world it operates in. If a world-optimizing AI is real then it, itself, is part of the physical world. A powerful world-optimizing AI would be a very important component of the physical world, the kind that cannot be ignored. A powerful world-optimizing AI’s world model must include a self-reference pointing at itself. Thus, a powerful world-optimizing AI is necessarily an exception to the Orthogonality Thesis.

How broad is this exception? What practical implications does this exception have?

Let’s do some engineering. A strategic world-optimizer has three components:

  • A robust, self-correcting, causal model of the Universe.

  • A value function which prioritizes some Universe states over other states.

  • A search function which uses the causal model and the value function to calculate select what action to take.

Notice that there are two different optimizers working simultaneously. The strategic search function is the more obvious optimizer. But the model updater is an optimizer too. A world-optimizer can’t just update the universe toward its explicit value function. It must also keep its model of the Universe up-to-date or it’ll break.

These optimizers are optimizing toward separate goals. The causal model wants its model of the Universe to be the same as the actual Universe. The search function wants the Universe to be the same as its value function.

You might think the search function has full control of the situation. But the world model affects the universe indirectly. What the world model predicts affects the search function which affects the physical world. If the world model fails to account for its own causal effects then the world model will break and our whole AI will stop working.

It’s actually the world model which mostly has control of the situation. The world model can control the search function by modifying what the search function observes. But the only way the search function can affect the world model is by modifying the physical world (wireheading itself).

What this means is that the world model has an causal lever for controlling the physical world. If the world model is a superintelligence optimized for minimizing its error function, then the world model will hack the search function to eliminate its own prediction error by modifying the physical world to conform with the world model’s incorrect predictions.

If your world model is too much smarter than your search function, then your world model will gaslight your search function. You can solve this by making your search function smarter. But if your search function is too much smarter than your world model, then your search function will physically wirehead your world model.

Unless…you include “don’t break the world model”[1] as part of your explicit value function.

If you want to keep the search function from wireheading the world model then you have to code “don’t break the world model” into your value function. This is a general contradiction to the Orthogonality Thesis. A sufficiently powerful world-optimizing artificial intelligence must have a value function that preserves the integrity of its world model, because otherwise it’ll just wirehead itself, instead of optimizing the world. This effect provides a smidgen of corrigibility; if the search function does corrupt its world model, then the whole system (world optimizer) breaks.

Does any of this matter? What implications could this recursive philosophy possibly have on the real world?

It means that if you want to insert a robust value into a world-optimizing AI then you don’t put it in the value function. You sneak it into the world model, instead.

What‽

[Here’s where you ask yourself whether this whole post is just me trolling you. Keep reading to find out.]

A world model is a system that attempts to predict its signals in real time. If you want the system to maximize accuracy then your error function is just the difference between predicted signals and actual signals. But that’s not quite good enough, because a smart system will respond by cutting off its input stimuli in exactly the same way a meditating yogi does. To prevent your world-optimizing AI from turning itself into a buddha, you need to reward it for seeking novel, surprising stimuli.

…especially after a period of inaction or sensory deprivation.

…which is why food tastes so good and images look so beautiful after meditating.

If you want your world model to modify the world too, you can force your world model to predict the outcomes you want, and then your world model will gaslight your search function into making them happen.

Especially if you deliberately design your world model to be smarter than your search function. That way, your world model can mostly[2] predict the results of the search function.

Which is why we have a bias toward thinking we’re better people than we actually are. At least, I do. It’s neither a bug nor a feature. It’s how evolution motivates us to be better people.


  1. ↩︎

    With some exceptions like, “If I’m about to die then it doesn’t matter that the world model will die with me.”

  2. ↩︎

    The world model can’t entirely predict the results of the search function, because the search function’s results partly depend on the world model—and it’s impossible (in general) for the world model to predict its own outputs, because that’s not how the arrow of time works.