Take 7: You should talk about “the human’s utility function” less.

As a writing exercise, I’m writing an AI Alignment Hot Take Advent Calendar—one new hot take, written every day (more or less—I’m getting back on track!) for 25 days. Or until I run out of hot takes.

When considering AI alignment, you might be tempted to talk about “the human’s utility function,” or “the correct utility function.” Resist the temptation when at all practical. That abstraction is junk food for alignment research.

As you may already know, humans are made of atoms. Collections of atoms don’t have utility functions glued to them a priori—instead, we assign preferences to humans (including ourselves!) when we model the world, because it’s a convenient abstraction. But because there are multiple ways to model the world, there are multiple ways to assign these preferences; there’s no “the correct utility function.”

Maybe you understand all that, and still talk about an AI “learning the human’s utility function” sometimes. I get it. It makes things way easier to assume there’s some correct utility function when analyzing the human-AI system. Maybe you’re writing about inner alignment and want to show that some learning procedure is flawed because it wouldn’t learn the correct utility function even if humans had one. Or that some learning procedure would learn that correct utility function. It might seem like this utility function thing is a handy simplifying assumption, and once you have the core argument you can generalize it to the real world with a little more work.

That seeming is false. You have likely just shot yourself in the foot.

Because the human-atoms don’t have a utility function glued to them, building aligned AI has to do something that’s actually materially different than learning “the human’s utility function.” Something that’s more like learning a trustworthy process. If you’re not tracking the difference and you’re using “the human’s utility function” as a target of convenience, you can all too easily end up with AI designs that aren’t trying to solve the problems we’re actually faced with in reality—instead they’re navigating their own strange, quasi-moral-realist problems.

Another way of framing that last thought might be that wrapper-minds are atypical. They’re not something that you actually get in reality when trying to learn human values from observations in a sensible way, and they have alignment difficulties that are idiosyncratic to them (though I don’t endorse the extent to which nostalgebraist takes this).

What to do instead? When you want to talk about getting human values into an AI, try to contextualize discussion of the human values with the process the AI is using to infer them. Take the AI’s perspective, maybe—it has a hard and interesting job trying to model the world in all its complexity, if you don’t short-circuit that job by insisting that actually it should just be trying to learn one thing (that doesn’t exist). Take the humans’ perspective, maybe—what options do they have to communicate what they want to the AI, and how can they gain trust in the AI’s process?

Of course, maybe you’ll try to consider the AI’s value-inference process, and find that its details make no difference whatsoever to the point you were trying to make. But in that case, the abstraction of “the human’s utility function” probably wasn’t doing any work anyhow. Either way.