TurnTrout comments on Towards deconfusing wireheading and reward maximization

TurnTrout 22 Sep 2022 21:44 UTC
LW: 4 AF: 3
1
AF
Before reading, here are my reactions to the main claims:
- Not only does RL not, by default, produce policies which have reward maximization as their behavioral objective, but in fact I argue that it is not possible for RL policies to care about “reward” in an embedded setting. (agreed under certain definitions of “reward”)
- I argue that this does not imply that wireheading in RL agents is impossible, because wireheading does not mean “the policy has reward as its objective”.It is still possible for an RL agent to wirehead, and in fact, is a high probability outcome under certain circumstances. (unsure, depends on previous definition)
- This bears a clean analogy to the human case, viewing humans as RL agents; there is no special mechanism going on in humans that is needed to explain why humans care about things in the world. (agreed insofar as I understand the claim)
(I ended up agreeing with these points by the end of the post!)
There even exists a policy that cares about that particular register at that particular physical location in the world even if it is not connected to the RL algorithm at all, but that policy does not get selected for by the RL algorithm.
Why not? Selection only takes place during training, the RL algorithm can only select across policies while the RL algorithm is running. A policy which wireheads and then disconnects the RL algorithm is selection-equivalent to a policy which wireheads and then doesn’t disconnect the RL algorithm (mod the cognitive-updates provided by reward in the latter case, but that seems orthogonal).
This resolves the question of why humans don’t want to wirehead—because you identify with the rest of the brain and so “your” desires are the mesaobjectives.
Agreed!
In particular, one consequence of this is we also don’t need to postulate the existence of some kind of special as yet unknown algorithm that only exists in humans to be able to explain why humans end up caring about things in the world. Whether humans wirehead is determined by the same thing that determines whether RL agents wirehead.
Agreed. Shard theory isn’t postulating such an algorithm, if that’s what you meant to imply. In fact, your entire end of the post seems to be the historical reasoning which Quintin and I went through in developing shard theory 101.