Power-seeking for successive choices

From single choice to successive choices

Alex Turner (and collaborators) have done a great work deconfusing instrumental convergence for a single choice. As a quick summary (read the paper and sequence if you want more details), they show that for any distribution over reward functions, if there are more “options” available after action 1 than after action 2, then most of the orbit of the distribution (the set of distributions induced by applying any permutation on the MDP, which thus permutes the initial distribution) has optimal policies that do action 1.

(A subtlety that matters is the meaning of optimal here: the result applies for all distributions only for average-step reward, which never cares about transient rewards and only value what happens in the loops. That being said, the distributions for which this doesn’t generalize to Blackwell optimality are a a set of zero measure—where Blackwell optimality is the “classic” RL notion of optimality with discount factor close to 1, which thus takes into account transient rewards)

So it’s not so much telling us something about a specific distribution over reward functions (which may incentivize the action with less options) but about its orbit. It’s giving teeth to the fragility of value with regard to instrumental convergence: we have to hit a target in the orbit that represents less than half of it.

Alex’s recent combinatorial generalizations considerably strengthen this point: if action 1 allows n times more options than action 2, at least of the orbit has optimal policies taking action 1.

For big enough , the target to hit for no instrumental convergence shrinks to vanishingly small.

In this post, I adapt these result for the case of successive choices: what is the proportion of the orbit for which it is optimal to take the first instrumental convergent choices? In addition to symmetry, additional information is required, in the form of the intersection of the sets of options for successive non-instrumentally convergent choices.

(Note: technically power-seeking and instrumental convergence are slightly different, but I’ll use the more common—in this context—power-seeking by simplicity)

Thanks to Alex Turner for answering my many questions, pointing out a problem in a previous draft, and for all the work that gave me the tools for this little contribution.

Two examples

Can we directly apply the power-seeking theorems for every single choice, and then combine them somehow to get power-seeking results for successive choices? Phrased differently, are the ratios for each single choice sufficient statistics to compute the ratios for successive choices?

In some cases, we need more. To see that, let’s start with a simple example.

Green nodes are nodes with a power-seeking choice, and red nodes are non-power-seeking options

The is the easy case, when no additional information is needed. Applying the combinatorial results to the first choice leads to of the orbit with optimal policies (for average-step reward; see Subtleties section for the case of Blackwell optimality) going right, and applying them to the second choice (after going right initially) leads to of the orbit with optimal policies going right.

Note that this second number would be the same if the MDP started at this second choice directly—only the “possible futures” matter here. This then tells us that, of the at least of the orbit that goes right at the first choice, there is then at least going right. Hence the successive ration is simply the product of the previous ratios (); and that for as many successive power-seeking choices as we want in this context.

Notably, this means that there are as much orbit element who take the two power-seeking choices as there are who don’t do that (so take either none or just the first one). And that despite the ratio of for the second power-seeking choice!

Let’s see with another example why this is not always how it works.

Green nodes are nodes with a power-seeking choice, and red nodes are non-power-seeking options;
the difference with the previous graph is that the non-power-seeking option is shared.

There is a single difference with the other graph: the two non-power-seeking options both lead to the same option that I shall call “death”, because a death state is a good example of a single state that can be reached in numerous ways.

First impact of this change: the first choice now has also a ratio of 1, because the death state is an available option for both actions. Since average-step optimality only cares about the final loop, optimal policies that want to reach the death state can do so by any mean. Hence no orbit element forbids its optimal policy to go right.

What happens for the next power-seeking choice? Here, death is only accessible on the left; we thus just apply the normal power-seeking theorems to get at least of the orbit arriving at this choice going left.

We get the successive ratio by a product, just like in the previous case. But notice that despite having two choices, we only felt the impact of one. And despite the number of options on each side being the same between the two examples, the ratio in the second case in instead of .

Note that adding intermediary power-seeking choices with only the shared death state as the non-power-seeking option wouldn’t have changed the successive ratio.

Subtleties

Of course, the situation is not always as simple as those two extremes. Here are some additional subtleties to consider:

  • What if there are both shared options like the death state and disjoint options? Then at the choices where the shared options is available after both power-seeking and non-power-seeking option, the theorems should be applied to this single choice without these shared states. The ratio for successive power-seeking choices is then just the product. That’s pretty much what I did for the death state example, transforming the first ratio to 1 (by removing the death state for the non-power-seeking options) and taking the product.

  • What if there are intermediary choices with no power-seeking option? Maybe I’m wrong, but that seems relatively simple as I see only two cases:

    • the intermediary choice is irrelevant (both actions leads to the same states): we multiply by 1

    • the intermediary choice spits the option in two: we multiply by . (which is actually just an application of the combinatorial results)

  • What if there are more than two actions per choices? This is a subtlety in computing the single choice ratio, and I don’t see any additional issue appearing in the successive choices case

  • What if we care about Blackwell optimality instead? The first problem is that the combinatorial results don’t apply for all distributions for Blackwell optimality, and we don’t have AFAIK a criteria for finding which distributions it fails on (but we know some things, like that such distribution must be discrete). Then, even if we can apply the combinatorial results, some rewards will incentivize taking one path to the shared state over another, which breaks the nice analysis above.
    Maybe these two problems cancel each other: maybe the distributions for which the combinatorial result apply are the ones which break the argument above (or at least contains the latter).
    But for the moment, this needs further work.

Conclusion: Disjointness in addition to symmetry

If what I’ve been writing is correct (which can only really be checked by a clean math proof that I’ll do at some point), then we need another graphical information, in addition to symmetry, to go from single choice power-seeking analysis to the successive choices case: whether the non-power-seeking options are disjoint over successive choices.

I’m quite excited about that, because even if the condition becomes slightly more involved, it still relies on a graphical and intuitive criteria.