TurnTrout comments on Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout 16 May 2023 20:19 UTC
LW: 2 AF: 2
0
AF
I added the following to the beginning:
Edit, 5/16/23: I think this post is beautiful, correct in its narrow technical claims, and practically irrelevant to alignment. This post presents a cripplingly unrealistic picture of the role of reward functions in reinforcement learning. I expect this post to harm your alignment research intuitions unless you’ve already inoculated yourself by deeply internalizing and understanding Reward is not the optimization target. If you’re going to read one alignment post I’ve written, read that one.
Follow-up work (Parametrically retargetable decision-makers tend to seek power) moved away from optimal policies and treated reward functions more realistically.
I’m now going to add this warning to other relevant posts in this sequence.