# TurnTrout comments on Draft report on existential risk from power-seeking AI

• I think the draft tends to use the term power to point to an intuitive concept of power/​influence (the thing that we expect a random agent to seek due to the instrumental convergence thesis). But I think the definition above (or at least the version in the cited paper) points to a different concept, because a random agent has a single objective (rather than an intrinsic goal of getting to a state that would be advantageous for many different objectives)

This is indeed a misunderstanding. My paper analyzes the single-objective setting; no intrinsic power-seeking drive is assumed.

• I probably should have written the “because …” part better. I was trying to point at the same thing Rohin pointed at in the quoted text.

Taking a quick look at the current version of the paper, my point still seems to me relevant. For example, in the environment in figure 16, with a discount rate of ~1, the maximally POWER-seeking behavior is to always stay in the same first state (as noted in the paper), from which all the states are reachable. This is analogous to the student from Rohin’s example who takes a gap year instead of going to college.

• Right. But what does this have to do with your “different concept” claim?

• A person does not become less powerful (in the intuitive sense) right after paying college tuition (or right after getting a vaccine) due to losing the ability to choose whether to do so. [EDIT: generally, assuming they make their choices wisely.]

I think POWER may match the intuitive concept when defined over certain (perhaps very complicated) reward distributions; rather than reward distributions that are IID-over-states (which is what the paper deals with).

Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.

• Two clarifications. First, even in the existing version, POWER can be defined for any bounded reward function distribution—not just IID ones. Second, the power-seeking results no longer require IID. Most reward function distributions incentivize POWER-seeking, both in the formal sense, and in the qualitative “keeping options open” sense.

To address your main point, though, I think we’ll need to get more concrete. Let’s represent the situation with a state diagram.

Both you and Rohin are glossing over several relevant considerations, which might be driving misunderstanding. For one:

Power depends on your time preferences. If your discount rate is very close to 1 and you irreversibly close off your ability to pursue percent of careers, then yes, you have decreased your POWER by going to college right away. If your discount rate is closer to 0, then college lets you pursue more careers quickly, increasing your POWER for most reward function distributions.

You shouldn’t need to contort the distribution used by POWER to get reasonable outputs. Just be careful that we’re talking about the same time preferences. (I can actually prove that in a wide range of situations, the POWER of state 1 vs the POWER of state 2 is ordinally robust to choice of distribution. I’ll explain that in a future post, though.)

My position on “is POWER a good proxy for intuitive-power?” is that yes, it’s very good, after thinking about it for many hours (and after accounting for sub-optimality; see the last part of appendix B). I think the overhauled power-seeking post should help, but perhaps I have more explaining to do.

Also, I perceive an undercurrent of “goal-driven agents should tend to seek power in all kinds of situations; your formalism suggests they don’t; therefore, your formalism is bad”, which is wrong because the premise is false. (Maybe this isn’t your position or argument, but I figured I’d mention it in case you believe that)

Actually, in a complicated MDP environment—analogous to the real world—in which every sequence of actions results in a different state (i.e. the graph of states is a tree with a constant branching factor), the POWER of all the states that the agent can get to in a given time step is equal; when POWER is defined over an IID-over-states reward distribution.

This is superficially correct, but we have to be careful because

1. the theorems don’t deal with the partially observable case,

2. this implies an infinite state space (not accounted for by the theorems),

3. a more complete analysis would account for facts like the probable featurization of the environment. For the real world case, we’d probably want to consider a planning agent’s world model as featurizing over some set of learned concepts, in which case the intuitive interpretation should come back again. See also how John Wentworth’s abstraction agenda may tie in with this work.

4. different featurizations and agent rationalities would change the sub-optimal POWER computation (see the last ‘B’ appendix of the current paper), since it’s easier to come up with good plans in certain situations than in others.

5. The theorems now apply to the fully general, non-IID case. (not publicly available yet)

Basically, satisfactory formal analysis of this kind of situation is more involved than you make it seem.

• You shouldn’t need to contort the distribution used by POWER to get reasonable outputs.

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP’s state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

This is superficially correct, but we have to be careful because

1. the theorems don’t deal with the partially observable case,

2. this implies an infinite state space (not accounted for by the theorems),

The “complicated MDP environment” argument does not need partial observability or an infinite state space; it works for any MDP where the state graph is a finite tree with a constant branching factor. (If the theorems require infinite horizon, add self-loops to the terminal states.)

• I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP’s state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state. This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

I replied to this point with a short post.

• This arbitrary choice effectively unrolls the state graph into a tree with a constant branching factor (+ self-loops in the terminal states) and we get that the POWER of all the states is equal.

Not necessarily true—you’re still considering the IID case.

I think using a well-chosen reward distribution is necessary, otherwise POWER depends on arbitrary choices in the design of the MDP’s state graph. E.g. suppose the student in the above example writes about every action they take in a blog that no one reads, and we choose to include the content of the blog as part of the MDP state.

Yes, if you insist in making really weird modelling choices (and pretending the graph still well-models the original situation, even though it doesn’t), you can make POWER say weird things. But again, I can prove that up to a large range of perturbation, most distributions will agree that some obvious states have more POWER than other obvious states.

Your original claim was that POWER isn’t a good formalization of intuitive-power/​influence. You seem to be arguing that because there exists a situation “modelled” by an adversarially chosen environment grounding such that POWER returns “counterintuitive” outputs (are they really counterintuitive, given the information available to the formalism?), therefore POWER is inappropriately sensitive to the reward function distribution. Therefore, it’s not a good formalization of intuitive-power.

I deny both of the ‘therefores.’

The right thing to do is just note that there is some dependence on modelling choices, which is another consideration to weigh (especially as we move towards more sophisticated application of the theorems to e.g. distributions over mesa objectives and their attendant world models). But you should sure that the POWER-seeking conclusions hold under plausible modelling choices, and not just the specific one you might have in mind. I think that my theorems show that they do in many reasonable situations (this is a bit argumentatively unfair of me, since the theorems aren’t public yet, but I’m happy to give you access).

If this doesn’t resolve your concern and you want to talk more about this, I’d appreciate taking this to video, since I feel like we may be talking past each other.

EDIT: Removed a distracting analogy.

• Just to summarize my current view: For MDP problems in which the state representation is very complex, and different action sequences always yield different states, POWER-defined-over-an-IID-reward-distribution is equal for all states, and thus does not match the intuitive concept of power.

At some level of complexity such problems become relevant (when dealing with problems with real-world-like environments). These are not just problems that show up when one adverserially constructs an MDP problem to game POWER, or when one makes “really weird modelling choices”. Consider a real-world inspired MDP problem where a state specifies the location of every atom. What makes POWER-defined-over-IID problematic in such an environment is the sheer complexity of the state, which makes it so that different action sequences always yield different states. It’s not “weird modeling decisions” causing the problem.

I also (now) think that for some MDP problems (including many grid-world problems), POWER-defined-over-IID may indeed match the intuitive concept of power well, and that publications about such problems (and theorems about POWER-defined-over-IID) may be very useful for the field. Also, I see that the abstract of the paper no longer makes the claim “We prove that, with respect to a wide class of reward function distributions, optimal policies tend to seek power over the environment”, which is great (I was concerned about that claim).