If you’re very optimistic about ELK then you should be optimistic about outer alignment

[The content of this short, nontechnical post is entirely unoriginal – it’s a reframing which I’ve personally found helpful for understanding the thrust of ELK. More specifically, all of the ideas here can be found in the ELK document and its appendices.

I came to this reframing while in conversation with Eric Neyman. Thanks also to Ben Edelman for feedback.]

Historically, the AI alignment problem was first posed along the following lines:

Suppose you have a superintelligent AI which is optimizing some utility function. Unless that utility function is well-aligned with human values, the AI will, in the course of optimizing its own utility function, probably destroy everything we care about. Note that the issue isn’t that the AI doesn’t understand human values – after all, the AI must be able to model human behavior very well in order to achieve its own objectives, and modeling human behavior seems hard to do without understanding human values. The issue is that the AI’s model of human values lives in its world model, not in its utility function.

[ETA: to be clear, the problem described above – finding a utility function that actually reflects our values – is what we might nowadays call outer alignment, and excludes concerns like safe exploration, robustness to distributional shift, mesa-optimizers, etc. For the rest of this post, when I write “alignment,” assume I’m talking about outer alignment.]

Some of the first ideas for alignment were along the lines of “maybe we can (1) get the AI to learn human values, and then (2) plug the learned model of human values in as a reinforcement learner’s utility function.” Broadly speaking, let’s lump this class of approaches together as “value learning.”

One funny thing about aligning a superintelligence via value learning is that it actually results in a redundant copy of human values: one copy, the one learned via value learning, lives in the utility function; and the other copy, the one learned by the RL agent, lives in the world model.[1]

Another funny thing about value learning is that getting an AI to selectively learn human values (and nothing else) seems harder than getting an AI to just form an accurate world model (which necessarily contains a model of human values inside of it). Part of the issue is that it’s easy to incentivize correct predictions (just train an ML model to make predictions, using a loss that compares its predictions against what actually happened), but much harder to rig up a parameterization of a utility function which is rich enough to encode human values and then train an ML model to learn the parameters (i.e. what IRL tries to do).

So, you might wonder, why don’t we first train an ML model to be a really good predictor – good enough that it must have a model of human values somewhere inside it[2] – and then try to “extract” out that copy of human values to plug in to an RL agent as a utility function? Or in other words, why don’t we:

Step 1: elicit the model’s latent knowledge of what humans really want, …

Step 2: repackage that knowledge into a utility function, …

Step 3: and plug that utility function into an RL agent?

If you’re very optimistic about ELK, then you should feel pretty good about this approach.

That said, this story is kinda insane. It involves being able to elicit a piece of a predictor’s world model as complex and fuzzily-delimited as “human values.” If you’re so optimistic about ELK that you don’t bat an eye at this … well, I have a bridge to sell you.

That’s why the ELK document focuses on “narrow questions”: things like “Have any of the sensors been tampered with?” and “Are there any nanobots inside of my brain?” Plausibly, being able to do ELK to extract honest answers to narrow questions like these could be sufficient to at least keep humans safe for a while. And it seems that for now, ELK is being marketed as just this not a solution to alignment, but a stopgap measure to keep us safe until we’re able to solve alignment some other way.

But note that the more questions you’re able to get ELK to work for and the better your ideas for turning this knowledge into something resembling a human’s utility function, the closer you might be to getting something like the 3-step plan above to work. The Indirect Normativity appendix to the ELK document gives a speculative proposal for bootstrapping answers to narrow questions into a utility function (i.e. completing step 2 above). The idea is pretty crazy but not obviously impossible, and it seems reasonable to hope that we’ll be able to come up with some less crazy ideas, at least one of which might work.

So, if you’re optimistic that:

  1. ELK is possible, i.e. we’ll be able to elicit a predictor’s knowledge on some set of “narrow” questions, …

  2. and this set of questions will be large enough that we can use some creative techniques to turn their answers into a utility function, …

then you should be reasonably optimistic about alignment.

  1. ^

    Even if you’re doing model-free RL, I’d expect there to be an implicit world-model somewhere – e.g. implicitly encoded in the learned Q-function if you’re doing Q-learning – otherwise your superintelligent agent wouldn’t be able to reliably select actions which got the results it wanted.

  2. ^

    To be clear, “a world model which is able to make good predictions must contain a model of human values” is an assumption. Some things that you might think which would cause you to reject this assumption: (1) human values don’t really exist; people just operate off of short-term heuristics; if you explained a state-of-the-world to me and asked whether it was good or bad, my answers would generally be incoherent and inconsistent. (2) Human values can’t actually be inferred from human behavior; rather, to understand human values you actually need the additional information of a detailed understanding of the human brain and its operation; a superintelligent AI interested in predicting human behavior will never have cause to form this detailed an understanding of the brain. (3) Probably other stuff.