Thoughts and problems with Eliezer’s measure of optimization power

Back in the day, Eliezer proposed a method for measuring the optimization power (OP) of a system S. The idea is to get a measure of small a target the system can hit:

You can quantify this, at least in theory, supposing you have (A) the agent or optimization process’s preference ordering, and (B) a measure of the space of outcomes—which, for discrete outcomes in a finite space of possibilities, could just consist of counting them—then you can quantify how small a target is being hit, within how large a greater region.

Then we count the total number of states with equal or greater rank in the preference ordering to the outcome achieved, or integrate over the measure of states with equal or greater rank. Dividing this by the total size of the space gives you the relative smallness of the target—did you hit an outcome that was one in a million? One in a trillion?

Actually, most optimization processes produce “surprises” that are exponentially more improbable than this—you’d need to try far more than a trillion random reorderings of the letters in a book, to produce a play of quality equalling or exceeding Shakespeare. So we take the log base two of the reciprocal of the improbability, and that gives us optimization power in bits.

For example, assume there were eight equally likely possible states {X0, X1, … , X7}, and S gives them utilities {0, 1, … , 7}. Then if S can make X6 happen, there are two states better or equal to its achievement (X6 and X7), hence it has hit a target filling 14 of the total space. Hence its OP is log2 4 = 2. If the best S could manage is X4, then it has only hit half the total space, and has an OP of only log2 2 = 1. Conversely, if S reached the perfect X7, 18 of the total space, then it would have an OP of log2 8 = 3.

The system, the whole system, and everything else in the universe

Notice that OP is defined in terms of the state that S achieved (for the moment this will be a pure world, but later we’ll allow probabilistically mixed worlds to be S’s “achievement”). So it give us a measure of how powerful S is in practice in our model, not some platonic measure of how good S is in general situations. So an idiot king has more OP than a brilliant peasant; a naive search algorithm distributed across the internet has more OP than a much better program running on Colossus. This does not seem a drawback to OP: after all, we want to measure how powerful a system actually is, not how powerful it could be in other circumstances.

Similarly, OP measures the system’s ability to achieve its very top goals, not how hard these goals are. A system that wants to compose a brilliant sonnet has more OP than exactly the same system that wants to compose a brilliant sonnet while embodied in the Andromeda galaxy. Even though the second is plausibly more dangerous. So OP is a very imperfect measure of how powerful a system is.

We could maybe extend this to some sort of “opposed OP”: what is the optimization power of S, given that humans want to stop it from achieving its goals? But even there, a highly powerful system with nearly un-achievable goals will still have a very low opposed OP. Maybe the difference between the opposed OP and the standard OP is a better measure of power.

As pointed out by Tim Tyler, OP can also increase if we change the size of the solution space. Imagine an agent that has to print out a non-negative integer N, and whose utility is -N. The agent will obviously print 0, but if the printer is limited to ten digit numbers, its OP is smaller than if the printer is limited to twenty digit numbers: though the solution is just as easy and obvious, the number of ways it “could have been worse” is increased, increasing OP.

Is it OP an entropy? Is it defined for mixed states?

In his post Eliezer makes a comparison between OP and entropy. And OP does have some of the properties of entropy: for instance if S is optimizing two separate independent processes (and its own utility treats them as independent), then its OP is the sum of the OP for each process. If for instance S hit an area of 14 in the first process (OP 2) and 18 in the second (OP 3), then it hits an area of 1/​(4*8)=1/​32 for the joint processes, for an OP of 5. This property, incidentally, is what allows us to talk about “the” entropy of an isolated system, without worrying about the rest of the universe.

But now imagine that our S in the first example can’t be sure to hit a pure state, but has 50% chance of hitting X7 and 50% of hitting X4. If OP were an entropy, then we’d simply do a weighted sum 1/​2(OP(X4)+OP(X7))=1/​2(1+3)=2, and then add one extra bit of entropy to represent our (binary) uncertainty as to what state we were in, giving a total OP of 3. But this is the same OP as X7 itself! And obviously a 50% of X7 and 50% of something inferior cannot be as good as a certainty of the best possible state. So unlike entropy, mere uncertainty cannot increase OP.

So how should OP extend to mixed states? Can we write a simple distributive law:

OP(1/​2 X4 + 12 X7) = 1/​2(OP(X4) + OP(X7)) = 2?

It turns out we can’t. Imagine that, without changing anything else, the utility of X7 is suddenly set to ten trillion, rather than 7. The OP of X7 is still 3 - it’s still the best option, still with probability 18. And yet 12 X4 + 12 X7 is now obviously much, much better than X6, which has an OP of 2. But now let’s reset X6 to being ten trillion minus 1. Then it still has a OP of 2, and yet is now much much better than 12 X4 + 12 X7.

But I may have been unfair in those examples. After all, we’re looking at mixed states, and X6 need not have a fixed OP of 2 in the space of mixed states. Maybe if we looked at the simplex formed by all mixed states made up of {X0, X1, … , X7}, we could get these results to work? Since all Xi are equally likely, we’d simply put a uniform measure on that simplex. But now we run into another problem: the OP of X7 has suddenly shot up to infinity! After all, X7 is now an event of probability zero, better than any other outcome; the log2 of the inverse of its probability is infinity. Even if we just restrict to a tiny, non-zero area, around X7, we get arbitrarily high OP—it’s not a fluke or a calculation error. Which means that if we followed the distributive law, Q=(1-10-1000) X0 + 10-1000 X7 must have a much larger OP than X6 - despite the fact that nearly every possible outcome is better than Q.

So it seems that unlike entropy, OP cannot have anything resembling a distributive law. The set of possible outcomes that you started with—including any possible mixed outcomes that S could cause—is what you’re going to have to use. This sits uncomfortably with the whole Bayesian philosophy—after all, there mixed states shouldn’t represent anything but uncertainty between pure states. They shouldn’t be listed as separate outcomes.

Measures and coarse-graining

In the previous section, we moved from using a finite set of equally likely outcomes, to a measure over a simplex of mixed outcomes. This is the natural generalisation of OP: simply compute the probability measure of the states better than what S achieves, and use the log2 of the inverse of this measure as OP.

Some of you may have spotted the massive elephant in the room, whose mass twists space and underlines and undermines the definition of OP. What does this probability measure actually represent? Eliezer saw it in his original post:

The quantity we’re measuring tells us how improbable this event is, in the absence of optimization, relative to some prior measure that describes the unoptimized probabilities.

Or how could I write “there were eight equally likely possible states” and “S can make X6 happen”? Well, obviously, what I meant was that if S didn’t exist, then it would be equally likely that X7 and X6 and X5 and X4 and...

But wait! These Xi’s are final states of the world—so they include the information as to whether S existed in them or not. So what I’m actually saying is that {X0(¬S), X1(¬S), … , X7(¬S)} (the worlds with no S) are equally likely, whereas Xi(S) (the worlds with S) are impossible for i≠6. But what has allowed me to identify X0(¬S) with X0(S)? I’m claiming they’re the same world “apart from S” but what does this mean? After all, S can have huge impacts, and X0(S) is actually an impossible world! So I’m saying that “there two worlds are strictly the same, apart that S exists in one of them, but them again, S would never allow that world to happen if it did exist, so, hum...”

Thus it seems that we need to use some sort of coarse-graining to identify Xi(¬S) with Xi(S), similar to those I speculated on in the reduced impact post.