TurnTrout comments on Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

TurnTrout 19 Nov 2022 2:07 UTC
LW: 2 AF: 2
0
AF
To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
This seems to prove too much in general, although it could be “right in spirit.” If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.
And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point?
I’m confused about what you mean here.
I was responding to:
To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models
I bet you can predict what I’m about to say, but I’ll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.
So I think the statement “how well do the agent’s motivations predict the reinforcement event” doesn’t make sense if it’s cast as “manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds).” I think it does make sense if you think about what behavioral influences (“shards”) within the agent will upweight logits on the actions which led to reward.