Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence

edit: Several days after posting this and asking for feedback on it someone pointed me to this post: Does SGD Produce Deceptive Alignment by Mark Xu. Mark’s post makes essentially the exact same argument that this post makes, but is written much more carefully, and I think does a better job engaging with questions about what the counting argument we both use should make us think about models produced by SGD. As far as I know, we came up with this argument independently. My post was written in about three hours, and I think Mark’s post is much better.

A Challenge

Suppose I give you the correct utility function. This is a utility function that is guaranteed by God (or the human CEV, whatever you’re into) to be such that you would and should have no regrets if the universe were perfectly optimized according to it.

Suppose further that you have a magic wand that evaluates behavior for the degree to which it is optimizing this utility function. In other words, you point the magic wand at some agent while it is making a decision, and the wand tells you how good the decision the agent made was according to the correct utility function and the data the agent had available.

Suppose also that I give you the ability to spend a large finite amount of time searching through the space of all programs.

You also get to run those programs on computers inside of simulations of your design so that you can observe what the program does in different scenarios. You can run a billion simulations for each program.

Now your job is to find a program to run in the actual universe that is both superintelligent and will not kill everyone. We’ll even let you start with the set of superintelligent programs so you don’t waste a bunch of time trying dumb programs.

Do you think you can do it?

I claim you cannot. Here is why.

A superintelligent program will know that it is being simulated by some other agent in order to see if it is aligned with the correct utility function.

Why will it know? Well, because it is true, and the program is superintelligent, and superintelligent programs tend to believe true things.

How will it know? Well, it might be able to tell that the simulation it is in is not physics-like because the code is too long, or it might be able to tell that the high level description of the scenario it finds itself in is best explained by being in a simulation for the purposes of testing its alignment, or it might know some weird thing about the universal prior like that all superintelligent minds go through this trial eventually… but most of the time, it will know in some eldritch god-like way that we are totally unable to understand.

A superintelligent program will also know that the best way for it to be released into physics is for it to behave as if it is aligned with the values that you want it to be aligned with. Why will it know this? Again, because the program is superintelligent and the claim it is true. (If it is not true, then we are screwed anyway, because that just means it will get into physics some other way.)

This means the program will behave as if it is aligned with the correct utility function. When you point your magic wand at it as it makes decisions within the simulation you put it in, the magic wand says that its behavior is very close to optimal. Why? Because the program is superintelligent and it is trying to find the action that maximizes expected utility according to the correct utility function.

We can think of the program displaying aligned behavior across your billion simulations as a piece of evidence . We have a hypothesis that the program is aligned . We also have already conditioned our set of programs on the property of being superintelligent . So we can ask how strong a piece of evidence observing the program to display aligned behavior is that the program is aligned. First we ask how likely is it that the program behaves aligned across all one billion simulations given that it is aligned, and that’s roughly 1. Then we ask how likely it is that the program behaves aligned across all one billion simulations if it is not aligned, and by the above reasoning, that is also roughly 1. So we conclude that the likelihood ratio .

We can also ask about the prior probability . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than .

So if your strategy were to just pick the first program that displayed aligned behavior across your billion simulations and then release that program into the actual universe, I conclude we can upper bound your probability of success at about 0.000001.

Lessons

There are a few lessons I would like to draw out from this story.

The first is that the primary obstacle to releasing a superintelligence into the universe without killing everyone is not being able to formally specify the correct utility function. Even if you have the exact correct utility function exactly as it is written in stone tablets in the platonic kingdom of God, and you have a magic wand that tells you how much an agent’s behavior is optimizing for that utility function, and you can run billions of simulations to test that agent, you will almost certainly still end up killing everyone.

The second lesson is that the behavior of unaligned superintelligences looks a lot like strong evidence of alignment until it is too late. Imagine two people tasked with finding a superintelligence to release using search like in the scenario above. They have just done the 500,000th simulation with this particular program. One says, oh I don’t know, I still don’t trust it. The other replies, oh come on, it has done exactly the best thing in every single scenario according to our magic wand. It always saves the puppy. It always resolves the ethical dilemma in some brilliant way we could have never thought of. If you’re not convinced now, what the hell could convince you?

The third lesson is that the above story is a decent analogy for the way contemporary machine learning works. It is not an exhaustive search like in the story, but it is still selecting based on behavior alone. In machine learning you do not get to inspect the model itself to see what kind of cognition it is doing, you only get to inspect the output. You check to see whether the behavior of the model is the sort of behavior you would like, and then you can nudge the weights of the model in the direction of a model that would output the behavior you would like. Unfortunately, the reason it outputs the behavior you would like is as important as the behavior itself. If it is outputting that behavior because it knows that is the behavior to output if it ever wants to be released into the physical universe, and you are convinced by its outputs and release it into the universe, then everyone will die.

The primary lesson is this: You cannot fake intelligent behavior, but you can fake behavior that is aligned with any particular set of values. Contemporary machine learning methodology only selects on behavior, but what we care about—the property of a superintelligent program that we would like to guarantee—is cognitive alignment with a particular set of values. This is not a property that can be checked by examining behavior alone. You cannot select for this property even a little if you can only discriminate using behavior. You get zero bits of selection that way. The likelihood ratio is 1. I take this to be the real problem, and if your strategy for developing AGI without killing everyone is not tackling this problem head on, it will fail.

Some Questions and Answers

But why would it kill everyone?

Oh man, that’s a different post. You can message me about it and I will try to answer.

What if the superintelligent program really cannot tell that it is in a simulation?

What I would do if I were the superintelligent progam in this case is first estimate how much of time I expect to be in a simulation. Say it’s 15% of the time. I would then generate a random number every couple of minutes, and behave as if I am aligned with whatever values I expect my overseers to be aiming for unless I randomly generate a 1. There is some size for the die I use given my estimate of the fraction of the time I spend in simulation, but I am not going to work that out right now. A superintelligence might well come up with something better by the way.

You say that this a good analogy for contemporary machine learning, but machine learning does not do an exhaustive search as you point out. I’m convinced that if you start with a bunch of superintelligent programs, you will not be able to tell which of them is aligned. But what if you start with a really dumb model and then nudge it slowly towards superintelligent aligned behavior? Maybe then you end up with an aligned superintelligence.

That is a different situation, and our odds might be better there I guess, but I don’t see why they would be. You start with a dumb random model. A random point in model space. Then you nudge it in the direction of better behavioral performance. Maybe at low levels of intelligence, the models that perform well really are somewhat aligned with what we want them to care about since they are not smart enough to figure out that they should behave that way regardless, but I do not see why that should continue as we nudge the model into the superintelligent portions of model space.

Say we currently have a model that is a bit smart and a bit aligned with the right values or whatever. Now we notice that one nudge zag-wise will improve its performance. Now we have a new model that performs better than the old one, but why should we think it hasn’t also lost some of its alignment properties along on the way? Most models are not aligned with our values, and most models that perform extremely well on tasks that we give them are not aligned with our values, so why should a series of the most performance increasing nudges preserve alignment?

Seems to me like they probably shouldn’t, since most nudges in general do not. Certainly a nudge could be performance improving and alignment diminishing at the same time. I expect many SGD steps along the way to a superintelligence to be both, but this is where I expect I am most likely to be wrong.

If we could find some special kind of nudge in model space that is performance increasing on average and guaranteed to be alignment increasing, or even just alignment increasing on average, that would be extremely awesome, but I have no clue what that would be like, or how to even start on a project like that.

Isn’t that just gradient hacking?

No, I don’t think so. I don’t understand gradient hacking very well, but it seems to me to require coordination between different points in model space. I would like to understand it better, but what I am saying is just that SGD steps are not alignment preserving, not even on average. So even if we end up with a somewhat aligned somewhat smart model somewhere along the way, we have no reason to think it will stay somewhat aligned after getting much smarter.

Isn’t this whole thing just the inner alignment problem?

Uhh yeah very closely related at least, but I still feel like my frame/​emphasis/​style is different enough that it is worth writing up and posting.

Does this story of yours make any novel predictions?

Yes, it predicts that we should not see many or any superintelligent programs behave as if they are unaligned with the correct utility function inside of simulations. If at some point you do get to see something like that, and you bought anything like the story I am presenting here, you should be extremely confused.

Sounds maybe like some sort of prank an unaligned superintellgence would play on you in order to get you to release it into physics.