I find myself agreeing with the idea that an agent unaware of it’s task will seek power, but also conclude that an agent aware of it’s task will give-up power.
I think this is a slight misunderstanding of the theory in the paper. I’d translate the theory of the paper to English as:
If we do not know an agent’s goal, but we know that the agent knows its goal and is optimal w.r.t it, then from our perspective the agent is more likely to go to higher-power states. (From the agent’s perspective, there is no probability, it always executes the deterministic perfect policy for its reward function.)
Any time the paper talks about “distributions” over reward functions, it’s talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done “before” the reward function is drawn, but there is no step where the agent is doing optimization but doesn’t know its reward.
In your paper, theorem 19 suggests that given a choice between two sets of 1-cycles C1 and C2 the agent is more likely to select the larger set.
I’d rewrite this as:
Theorem 19 suggests that, if an agent that knows its reward is about to choose between C1 and C2, but we don’t know the reward and our prior is that it is uniformly distributed, then we will assign higher probability to the agent going to the larger set.
I do not see how the agent ‘seeks’ out powerful states because, as you say, the agent is fixed.
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says “optimal agents seek power”, I think you should translate it as “when we don’t know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power”, even though the agent itself is not thinking “ah, this state is powerful, I’ll go there”.
I think this is a slight misunderstanding of the theory in the paper. I’d translate the theory of the paper to English as:
Any time the paper talks about “distributions” over reward functions, it’s talking from our perspective. The way the theory does this is by saying that first a reward function is drawn from the distribution, then it is given to the agent, then the agent thinks really hard, and then the agent executes the optimal policy. All of the theoretical analysis in the paper is done “before” the reward function is drawn, but there is no step where the agent is doing optimization but doesn’t know its reward.
I’d rewrite this as:
[Deleted]
I do think this is mostly a matter of translation of math to English being hard. Like, when Alex says “optimal agents seek power”, I think you should translate it as “when we don’t know what goal an optimal agent has, we should assign higher probability that it will go to states that have higher power”, even though the agent itself is not thinking “ah, this state is powerful, I’ll go there”.