I’m the author.
Colloquially, they’re more of the flavor “for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states”.
This refers to the fact that most utility functions are retargetable. But the most important part of the power-seeking theorems is the actual power-seeking, which is proven in the appendix of Parametrically Retargetable Decision-Makers Tend To Seek Power, so I don’t agree with your summary.
[...] the definition you give of “power” as expected utility of optimal behavior is not the same as that used in the power-seeking theorems. [...]
Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions.
There is no averaging over utility functions happening, the averaging is over reward functions. From Parametrically Retargetable Decision-Makers Tend To Seek Power: “a trained policy π seeks power when π’s actions navigate to states with high average optimal value (with the average taken over a wide range of reward functions.” This matches with what I wrote in the article.
I do agree that utility functions are missing from the post, but they aren’t averaged over. They relate to the decision-making of the agent, and thus to the condition of retargetability that the theorems require.
For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.
I think the theta is not a property of the agent, but of the training prodecure. Actually, Parametrically retargetable decision-makers tend to seek power is not about trained agents in the first place, so I’d say we’re never talking about different agents in the first place.
I agree with this if we constrain ourselves to Turner’s work.
While V. Krakovna’s work still depends on the option-variegation, but we’re not picking random reward-functions, which is a nice improvement.
Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven’t been seen during training, which I don’t think is a strong assumption since we’re usually not training over all possible combination of pixels. Do you agree with this?
If you have concrete suggestions that you’d like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I’d appreciate it :)
Maybe its also useless to discuss this...