Operationalizing compatibility with strategy-stealing

Thanks to Noa Nabeshima and Kate Woolverton for helpful comments and feedback.

Defining optimization power

One of Eliezer’s old posts which I think has stood the test of time the best is his “Measuring Optimization Power.” In it, Eliezer defines optimization power as follows.[1] Let be some action space and be some probability measure over actions. Then, for some utility function and particular action , Eliezer defines the bits of optimization power in as

which, intuitively, is the number of times that you have to cut the space in half before you get an action as good according to as .

In my opinion, however, a better, more intuitive version of the above definition can be obtained by using quantilizers. A -quantilizer relative to some utility function and base distribution over actions is a system which randomly selects an action from the top fraction of actions from sorted by . Thus, a -quantilizer selects actions randomly from the top 10% of actions according to . Intuitively, you can think about this procedure as being basically equivalent to randomly sampling actions from and picking the best according to .

Now, using quantilizers, we can give a nice definition of optimization power for an entire model. That is, given a model , let be the smallest fraction[2] such that a -quantilizer with base distribution is at least as good[3] at satisfying as . Then, let . What’s nice about this is that it gives us a measure of optimization power for a whole model and a nice intuitive picture of what it would look like for a model to have that much optimization power—it would look like a -quantilizer.

Both of these definitions do still leave the distribution unspecified, but if we want a very general notion of optimization power then I would say that should probably be some sort of universal prior such that simple policies are weighted more heavily than their more complex counterparts. If we use the universal prior, we get the nice property that the more complex the policy needed to optimize some utility function, the more optimization power is needed. Thus, we can replace with just where is assumed to be some universal prior.

Compatibility with strategy-stealing

Now, given such a definition of optimization power, I think we can give a nice definition of what it would mean for an AI system/​training procedure to be compatible with the strategy-stealing assumption. Intuitively, we will say that an AI system/​training procedure which maps utility functions onto models is compatible with strategy-stealing if doesn’t vary much over some set of utility functions —that is, if isn’t better at optimizing for (or producing models which optimize for) some objectives in than others. We can make this definition more precise for a set of utility functions if we ask for to be small.[4] This definition is very similar to my definition of value-neutrality, as they are both essentially pointing at the same concept. What’s nice about using here, though, is that it lets us compare very difficult-to-satisfy utility functions with much easier-to-satisfy ones on equal footing, as we’re just asking for to produce actions which always score in the top whatever percent—which should be equally easy to achieve regardless of how inherently difficult is to satisfy.[5]

Notably, this definition of compatibility with strategy-stealing is somewhat different than others’ notions in that it is about a property of a single AI system/​training procedure rather than a property of a deployment scenario or a collection of AI systems.[6] As a result, however, I think it makes this notion of compatibility with strategy-stealing much more meaningful in more homogenous and/​or unipolar takeoff scenarios.

In particular, consider a situation in which we manage to build a relatively powerful and relatively aligned AI system such that you can tell it what you want and it will try to do that. However, suppose it’s not compatible with strategy-stealing such that it’s much better at achieving some of your values than others—because it’s much better at achieving the easy-to-measure ones, as our current training procedures are, for example. As a result, such a world could easily end up going quite poorly simply because the easy-to-measure values end up getting all of the resources at the expense of the hard-to-measure ones.

As a specific example, suppose we manage to build such a system and align it with Sundar Pichai. Sundar, as Google’s CEO, wants Google to make money, but presumably he also cares about lots of other things like wanting other people to be happy, he and his family to be safe, the world to be in a generally good spot, and so on. Now, even if we manage to build an AI system which is aligned with Sundar in the sense that it tries to do whatever Sundar tells it to do, if it’s much better at satisfying some of Sundar’s values than others of Sundar’s values—much better at making Google money than putting the world in a generally good spot, for example—then I think that could end quite poorly. Sundar might choose to give this AI lots of flexible power and influence so that it can make Google a bunch of money, for example, and not realize that it will cause his other values to lose out in the long run—or Sundar could be effectively forced to do so due to Google needing to compete with other companies using similar systems.

In such a situation, compatibility with strategy-stealing is a pretty important desideratum and also one which is relatively independent of (at least a naive version of) intent alignment. The Sundar example also suggests what should be, which is the set of possible values that we might actually want our AI systems to pursue. As long as we produce systems that are compatible with strategy-stealing under such a , then that should ensure that scenarios like the above don’t happen.

  1. Note that I’m taking some liberties in making Eliezer’s definition somewhat more formal. ↩︎

  2. It is worth noting that there is a possibility for the smallest fraction to be undefined here if is flat past some point—if all actions past the top 0.1% perform equally according to , for example. ↩︎

  3. If we want to be precise, we can define the quantilizer being “at least as good” as according to to mean that the quantilizer gets at least as much reward in expectation as the on some given POMDP with reward function . Alternatively, if we’re okay with a more intuitive definition, we can just say that the quantilizer is at least as good as if a maximizer would choose to instantiate the quantilizer over . ↩︎

  4. We can also let be a distribution instead of a set. ↩︎

  5. There is a difficulty here if some of the are flat over large regions of the space such that actions look much more optimized than they actually are, such as the case that was mentioned previously in which is flat past some point such that is undefined past that point. This sort of difficulty can be ruled out, however, if we have the condition that . ↩︎

  6. I am currently mentoring Noa Nabeshima in writing up a better summary/​analysis of these different notions of strategy-stealing which should hopefully help resolve some of these confusions. ↩︎