Inner alignment requires making assumptions about human values

Many approaches to AI alignment require making assumptions about what humans want. On a first pass, it might appear that inner alignment is a sub-component of AI alignment that doesn’t require making these assumptions. This is because if we define the problem of inner alignment to be the problem of how to train an AI to be aligned with arbitrary reward functions, then a solution would presumably have no dependence on any particular reward function. We could imagine an alien civilization solving the same problem, despite using very different reward functions to train their AIs.

Unfortunately, the above argument fails because aligning an AI with our values requires giving the AI extra information that is not encoded directly in the reward function (under reasonable assumptions). The argument for my thesis is subtle, and so I will break it into pieces.

First, I will more fully elaborate what I mean by inner alignment. Then I will argue that the definition implies that we can’t come up with a full solution without some dependence on human values. Finally, I will provide an example, in order to make this discussion less abstract.

Characterizing inner alignment

In the last few posts I wrote (1, 2), I attempted to frame the problem of inner alignment in a way that wasn’t too theory-laden. My concern was that the previous characterization was dependent on a solving particular outcome where you have an AI that is using an explicit outer loop to evaluate strategies based on an explicit internal search.

In the absence of an explicit internal objective function, it is difficult to formally define whether an agent is “aligned” with the reward function that is used to train it. We might therefore define alignment as the ability of our agent to perform well on the test distribution. However, if the test set is sampled from the same distribution as the training data, this definition is equivalent to the performance of a model in standard machine learning, and we haven’t actually defined the problem in a way that adds clarity.

What we really care about is whether our agent performs well on a test distribution that doesn’t match the training environment. In particular, we care about the agent’s performance on during real-world deployment. We can estimate this real world performance ahead of time by giving the agent a test distribution that was artificially selected to emphasize important aspects of the real world more closely than the training distribution (eg. by using relaxed adversarial training).

To distinguish the typical robustness problem from inner alignment, we evaluate the agent on this testing distribution by observing its behaviors and evaluating it very negatively if it does something catastrophic (defined as something so bad we’d prefer it to fail completely). This information is used to iterate on future versions of the agent. An inner aligned agent is therefore defined as an agent that avoids catastrophes during testing.

The reward function doesn’t provide enough information

Since reward functions are defined as mappings between state-action pairs and a real number, our agent doesn’t actually have enough information from the reward function alone to infer what good performance means on the test. This is because the test distribution contains states that were not available in the training distribution.

Therefore, no matter how much the agent learns about the true reward function during training, it must perform some implicit extrapolation of the reward function to what we intended, in order to perform well on the test we gave it.

We can visualize this extrapolation as if we were asking a supervised learner what it predicts for inputs beyond the range it was provided in its training set. It will be forced to make some guesses for what rule determines what the function looks like outside of its normal range.

One might assume that we could just use simplicity as the criterion for extrapolation. Perhaps we could just say, formally, the simplest possible reward function that encodes the values observed during training is the “true reward” function that we will use to test the agent. Then the problem of inner alignment reduces to the problem of creating an agent that is able to infer the true reward function from data, and then perform well according to it inside general environments. Framing the problem like this would minimize dependence on human values.

There are a number of problems with that framing, however. To start, there are boring problems associated with using simplicity to extrapolate the reward function, such as the fact that one’s notion of simplicity is language dependent, often uncomputable, and the universal prior is malign. Beyond these (arguably minor) issues, there’s a deeper issue, which forces us to make assumptions about human values in order to ensure inner alignment.

Since we assumed that the training environment was necessarily different from the testing environment, we cannot possibly provide the agent information about every possible scenario we consider catastrophic during training. Therefore, the metric we were using to judge the success of the agent during testing is not captured in training data alone. We must introduce additional information about what we consider catastrophic. This information comes in the form of our own preferences, as we prefer the agent to fail in some ways but not in others.

It’s also important to note that if we actually did provide the agent with the exact same data during training as it would experience during deployment, this is equivalent to simply letting the agent learn in the real world, and there would be no difference between training and testing. Since we normally assume providing such a perfect environment is either impossible or unsafe, the considerations in that case become quite different.

An example

I worry my discussion was a bit too abstract to be useful, so I’ll provide a specific example to show where my thinking lies. Consider the lunar lander example that I provided in the last post.

To reiterate, we train an agent to land on a landing pad, but during training there is a perfect correlation between whether a landing pad is painted red and whether it is a real landing pad.

During deployment, if the “true” factor that determined whether a patch of ground is a landing pad was whether it is enclosed by flags, and some faraway crater is painted red, then the agent might veer off into the crater rather than landing on the landing pad.

Since there is literally not enough information during training to infer what property correctly determines whether a patch of ground is a landing pad, the agent is forced to infer whether its the flags or the red painting. It’s not exactly clear what the “simplest” inference is here, but it’s coherent to imagine that “red painting determines whether something is a landing pad” is the simplest inference.

As humans, we might have a preference for the flags being the true determinant, since that resonates more with what we think a landing pad should be, and whether something is painted red is not nearly as compelling to us.

The important point is to notice that our judgement here is determined by our preferences, and not something the agent could have learned during training using some value-neutral inferences. The agent must make further assumptions about human preferences for it to consistently perform well during testing.


  1. You might wonder whether we could define catastrophe in a completely value-independent way, sidestepping this whole issue. This is the approach implicitly assumed by impact measures. However, if we want to avoid all types of situations where we’d prefer the system fail completely, I think this will require a different notion of catastrophe than “something with a large impact.” Furthermore, we would not want to penalize systems for having a large positive impact.