One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)