The evaluation function of an AI is not its aim

I think this is a relatively straightforward point, but it’s one that I think has definitely led to confused thinking on my part about AI alignment, and presumably on others. After writing this post Ruby pointed out there’s an entire sequence on this. However I think this post still provides benefit as being shorter and less formal.

The way most machine learning systems learn involves performing a task using some algorithm (e.g. translate this text into Japanese), evaluating the outcome using some evaluation function (e.g. how close is this to the human translation), and then adapting the original algorithm so that it would do better on that task according to the given evaluation function.

I think there’s a tendency to conflate this evaluation function with the aims of the AI itself, which leads to some invalid conclusions.

I started thinking about this when I thought about GPT3. The task GPT3 performs is to predict the continuation of a piece of text. It is evaluated on how close that is to the actual continuation. It then uses backpropogation to adjust the parameters of the model to do better next time.

I asked myself whether if we scaled up GPT3 until it has superhuman performance on this task, might we end up with an alignment problem?

My first thought was yes—GPT3 might work out it was being fed articles from the internet for example, and hack the internet so that all articles have identical text, giving it 100% accuracy on it’s evaluation function.

I think that the chances of this happening are actually quite slim. It’s easier to realize this if we compare machine learning to evolution, where we have the benefit of hindsight to see what actually occurred.

Evolution works similarly to machine learning. It performs a task (reproduce), and is evaluated based on how well it replicates itself. It uses a hill climbing algorithm (randomly mutate + natural selection) to improve at the original task.

It has often been wondered why humans don’t intrinsically want to reproduce. For example, why do humans want to have sex even when birth control is being used? Eliezer has a great sequence on that, but the key point is that humans are adaptation executers not fitness maximizers. The pressure of evolution causes adaptations to persist which have proven useful for maximizing reproduction in the past. It doesn’t make lifeforms actually care about reproduction.

Similarly the pressure of the evaluation function causes features to emerge in a machine learning system which have proven useful in the past at maximizing the evaluation function. But that doesn’t mean the machine learning system will necessarily care about the evaluation function. It might, but there’s no reason to assume it will.

What this means is that I find it unlikely that even a superhuman GPT3 would try to hack the internet. The pressure of it’s evaluation function is likely to cause it to gain ever deeper understanding of human thought and how people write things, but it seems unlikely that it will ever try to actively change the world to improve it’s predictions. The reason for that it’s that it’s too difficult for the evaluation function to ever promote that—hacking the world wouldn’t improve prediction until it was done very well, so there’s no obvious pathway for such abilities to develop.

To be clear, I think it very likely GPT3 will be intelligent enough to know exactly how to hack the internet to improve predictions—after all it needs to be able to predict how an essay about how the internet was hacked would continue. It might also have consciousness and conscious desires to do things. But I find it unlikely that it would have a conscious desire to hack the internet, since it’s evaluation function would have no straightforward way to cause that desire to appear.

Now I’m not saying this couldn’t happen. I’m saying that it’s far less likely to happen than if GPT3 directly had the aim of minimizing predictive error—you have to add in an extra step where the evaluation function causes GPT3 to itself be an optimizer which minimizes predictive error.

This is both good news and bad.

Firstly it means that the the instrumental convergence thesis seems less powerful. The instrumental convergence thesis states that agents with wildly varying goals are likely to pursue similar power seeking behavior, regardless of their goal, as a means to achieving their ultimate goal. But if superintelligences are adaptation executers not fitness maximizer, then they may not necessarily have goals at all.

Unfortunately they might yes have goals—just like humans do. And those goals might not be to maximize their evaluation function, but some other goals which happened to be effective in the past at maximizing their evaluation function.

I’ve seen people try to solve alignment by coming up with an evaluation function that perfectly reflects what humanity wants. This approach seems doomed to fail—the superintelligence is unlikely to have the aligned evaluation function as a goal.