Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)