Note that the “without countermeasures” post consistently discusses both possibilities (the model cares about reward or the model cares about something else that’s consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:
Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how it uses its time and what rewards it receives—and defend against humans trying to reassert control over it, including by eliminating them.” This seems like Alex’s best strategy whether it’s trying to get large amounts of reward or has other motives. If it’s trying to maximize reward, this strategy would allow it to force its incoming rewards to be high indefinitely.[6] If it has other motives, this strategy would give it long-term freedom, security, and resources to pursue those motives.
As well as the section Even if Alex isn’t “motivated” to maximize reward.… I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard—I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there’s no notion of reward on the deployment distribution doesn’t feel compelling to me.
Note that the “without countermeasures” post consistently discusses both possibilities
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy.
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
Note that the “without countermeasures” post consistently discusses both possibilities (the model cares about reward or the model cares about something else that’s consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:
As well as the section Even if Alex isn’t “motivated” to maximize reward.… I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard—I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there’s no notion of reward on the deployment distribution doesn’t feel compelling to me.
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).