ALBA: can you be “aligned” at increased “capacity”?

I think that Paul Christiano’s ALBA proposal is good in practice, but has conceptual problems in principle.

Specifically, I don’t think it makes sense to talk about bootstrapping an “aligned” agent to one that is still “aligned” but that has an increased capacity.


The main reason being that I don’t see “aligned” as being a definition that makes sense distinct from capacity.

These are not the lands of your forefathers

Here’s a simple example: let be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).

Then the initial agent - , a human—trains a reward for an agent . This agent is limited in some way—maybe it doesn’t have much speed or time—but the aim is for to ensure that is aligned with .

Then the capacity of is increased to , a slow powerful agent. It computers the reward to ensure the alignment of , and so on.

The nature of the agents is not defined—they might be algorithms calling for as subroutines, humans may be involved, and so on.

If the humans are unimaginative and don’t deliberately seek out more extreme and exotic test cases, the best case scenario is for as .

And eventually there will be an agent that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with , because they share the same objective. And then will push the world into extra-ordinary circumstance and proceed to maximise , with likely disastrous results for us humans.

The nature of the problem

So what went wrong? At what point did the agents go out of alignment?

In one sense, at . In another sense, at (and, in another interesting sense, at , the human). The reward was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.