Rohin Shah comments on Clarifying Power-Seeking and Instrumental Convergence

Rohin Shah 25 Dec 2019 6:13 UTC
LW: 7 AF: 4
AF
Here’s my explanation of what’s going on with that last theorem:
Consider some state s in a deterministic finite MDP with a perfectly optimal agent, where the rewards for each state are sampled uniformly and iid from the interval [0, 1]. We can “divide up” POWER(s) into contributions from all of the possibilities that are optimal for at least one reward, with the contributions weighted by the optimality measure for each possibility. (This is why POWER contribution depends on the optimality measure.) The paper proves that if one set of paths contributes 2K times as much power as another set, the first set must be at least K times more likely.
I was initially confused why this notion of power doesn’t directly correspond to instrumental convergence, but instead only puts a bound on instrumental convergence. This is because expected reward can vary across possibilities. In particular, if you have two non-dominated possibilities f1 and f2, and you choose a random reward r1 (respectively, r2) that f1 (respectively, f2) is optimal for, then expected reward of f1 under r1 can be different from expected reward of f2 under r2. This changes the relative balance of power between them but doesn’t change the relative balance of the probability of each possibility.