We cannot directly choose an AGI’s utility function

This is my first post at this forum, and I am going to address a confusion that I believe some people have, and that may hinder some beginners from being able to contribute helpful ideas to the AI safety debate.

Namely, the confusion is the idea that we can choose an AGI’s utility function, and that as a consequence, that the biggest problem in AI alignment is figuring out what utility function an AGI should have.

Because I have not seen anyone argue for that explicitly, I’m going to add some quotes to documents that beginners are expected to read, and I’ll argue why someone reading these quotes might confused as a result.

Is it easy to give an intelligent machine a goal?

Here is from MIRI’s Four Background Claims:

Regardless of their intelligence level, and regardless of your intentions, computers do exactly what you programmed them to do. If you program an extremely intelligent machine to execute plans that it predicts lead to futures where cancer is cured, then it may be that the shortest path it can find to a cancer-free future entails kidnapping humans for experimentation (and resisting your attempts to alter it, as those would slow it down).

Computers do indeed do exactly as you programmed them to do, but only on the very narrow sense of executing the exact machine instructions you give.

There are clearly no instructions for executing “plans that the machine predicts will lead to futures where some statement X is true” (where X may be “cancer is cured”, or something else).

Unless we have fundamental advances in alignment, it will likely not be possible to give orders to intelligent machines like that. I will go even further and question whether, among the order-following AGIs, the ones that follows orders literally and dangerously are any easier to build. As far as I am aware of both may be actually equally difficult to come up with.

If this is so, then the cautionary tale sounds a little off. Yes, they might help us recognize how hard it is for us to know precisely what we want. And yes, they might serve to impress people with easy-to-imagine examples of AGIs causing a catastrophe.

But it may also enshrine a mindset that the AI alignment problem is mostly about finding the ‘right’ utility function for an AGI to have (that takes into account corrigibility, impact metrics and so on), when we actually have no evidence this will help.

Here is a similar example from Nick Bostrom’s Superintelligence:

There is nothing paradoxical about an AI whose sole final goal is to count the grains of sand on Boracay, or to calculate the decimal expansion of pi, or to maximize the total number of paperclips that will exist in its future light cone. In fact, it would be easier to create an AI with simple goals like these than to build one that had a human-like set of values and dispositions.

Again, I understand it might be easier to create an AGI with one of these “simple” goals, but it is nowhere obvious to me. How exactly it is easier to create a very intelligent machine that really wants to calculate the decimal expansion of pi?

What would be easy would be to create an algorithm that does calculate the digits of pi, without being intelligent or goal-oriented, or maybe an optimization process that has the calculation of the decimal expansion of pi as a Base Objective. But do we get to have any kind of control over what the Mesa-optimizer itself really wants?

For example, in Reinforcement Learning we can assign a “goal” to an agent in a simulated environment. This, together with a learning rule, acts is an optimizer that searches for agents who reach our goals. However, the agents themselves are often either dumb (following fixed policies) or end up optimizing for proxies of the goals we set.

Designing an agent with specific utility functions can be even harder if efficient training requires some sort of intrinsic motivation system in order to enable deeper exploration. Such systems provide extra reward information, which may be especially useful if the external signals are too limited to allow efficient learning on their own. These mechanisms may facilitate learning in the desired direction, but at the cost of creating goals independent of the external reinforcement, and thus unaligned to the base objective.

Where will an AGI’s utility function come from?

Most people think intuitively that an AGI will be fully rational, without biases such as the ones we ourselves have.

According to the Von Neumann-Morgenstern utility theorem, a rational decision procedure is equivalent to optimization of the expected value of a utility function. This result, which relies only on very weak assumptions, implies that goal-oriented superintelligent machines will likely maximize utility functions. But if we the AGI’s creators cannot easily decide what the AGI will want, where will this utility function come from?

I argued here that AGIs will actually be created with several biases, and that they will only gradually remove them (become rational) after several cycles of self-modification.

If this is true, then AGIs will likely be created initially with no consistent utility function at all. Rather, they might have a conflicting set of goals and desires that will only eventually become crystalized or enshrined in a utility function.

We might therefore try to focus on what conflicting goals and desires the initial version of a friendly AGI is likely to have, and how to make sure properties important to our well-being get preserved when such a system modifies itself. We might be able to do so without necessarily going as far as trying to predict what a friendly utility function will be.