General alignment plus human values, or alignment via human values?

Thanks to Rebecca Gorman for discussions that lead to these insights.

How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem—how to get a powerful AI to safely do anything—which, once we’ve solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values—these values must be aimed at, from the start of the process.

I’m of the second perspective, but I think it’s very important to sort this out. So I’ll lay out some of the arguments in its favour, to see what others think of it, and so we can best figure out the approach to prioritise.

More strawberry, less trouble

As an example of the first perspective, I’ll take Eliezer’s AI task, described here:

  • “Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level.” A ‘safely’ aligned powerful AI is one that doesn’t kill everyone on Earth as a side effect of its operation.

If an AI accomplishes this limited task without going crazy, this shows several things:

  1. It is superpowered; the task described is beyond current human capabilities.

  2. It is aligned (or at least alignable) in that it can accomplish a task in the way intended, without wireheading the definitions of “strawberry” or “cellular”.

  3. It is safe, in that it has not heavily dramatically reconfigured the universe to accomplish this one goal.

Then, at that point, we can add human values to the AI, maybe via “consider what these moral human philosophers would conclude if they thought for a thousand years, and do that”.

I would agree that, in most cases, an AI that accomplished that limited task safely would be aligned. One might quibble that it’s only pretending to be aligned, and preparing a treacherous turn. Or maybe the AI was boxed in some way and accomplished the task with the materials at hand within the box.

So we might call an AI “superpowered and aligned” if it accomplished the strawberry copying task (or a similar one) and if it could dramatically reconfigure the world but chose not to.

Values are needed

I think that an AI could not be “superpowered and aligned” unless it is also aligned with human values.

The reason is that the AI can and has to interact with the world. It has the capability to do so, by assumption—it is not contained or boxed. It must do so because any agent affects the world, through chaotic effects if nothing else. A superintelligence is likely to have impacts in the world simply through its existence being known, and if the AI finds it efficient to have interactions with the world (eg. ordering some extra resources) then it will do so.

So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, “dangerous” and “large” are defined by human values.

Suppose that the AI realises that its actions have slightly imbalanced the Earth in one direction, and that, within a billion years, this will cause significant deviations in the orbits of the planets, deviations it can estimate. Compared with that amount of mass displaced, the impact of killing all humans everywhere is a trivial one indeed. We certainly wouldn’t want it to kill all humans in order to be able to carefully balance out its impact on the orbits of the planets!

There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.). If the AI accomplishes its task with a universal constructor or unleashing hordes of nanobots that gather resources from the world (without disrupting human civilization), it still has to decide whether to allow humans access to the constructors or nanobots after it has finished copying the strawberry—and which humans to allow this access to.

So every decision the AI makes is a tradeoff in terms of its impact on the world. Navigating this requires it to have a good understanding of our values. It will also need to estimate the value of certain situations beyond the human training distribution—if only to avoid these situations. Thus a “superpowered and aligned” AI needs to solve the problem of model splintering, and to establish a reasonable extrapolation of human values.

Model splintering sufficient?

The previous sections argue that learning human values (including model splintering) is necessary for instantiating an aligned AI; thus the “define alignment and then add human values” approach will not work.

Thus, if you give this argument much weight, learning human values is necessary for alignment. I personally feel that it’s also (almost) sufficient, in that the skill in navigating model splintering, combined with some basic human value information (as given, for example, by the approach here) is enough to get alignment even at high AI power.

Which path to pursue for alignment

It’s important to resolve this argument, as the paths for alignment that the two approaches suggest are different. I’d also like to know if I’m wasting my time on an unnecessary diversion.