Buck comments on Buck’s Shortform

Buck 10 Apr 2022 16:08 UTC
LW: 11 AF: 8
0
AF
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
- Richard_Ngo 11 Apr 2022 20:28 UTC
  LW: 3 AF: 2
  0
  AF Parent
  One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
  How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
  Two potential alternatives to the thing you said:
  - maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
  - maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)