This is very interesting and I’ll have a crack. I have an intuition that two restrictions are too strong, and a proof in either direction will exploit them without telling you much about real gradient hackers:
Having to take your weights as inputs: If I understand this correctly the weight restrictions needed mean θ actually lives on a d dimensional subspace of Rw. My intuition is that in real life the scenarios we expect gradient hackers to get clues to their parameters but not necessarily a full summary.
In a reality maybe the information would be clues that the model can gleam from it’s own behaviour about how it’s being changed. Perhaps this is some low dimensional summary of it’s functional behaviour on a variety of inputs?
Having to move toward a single point θg: I think the phenomena should not be viewed in parameter space, but function space although I do anticipate that makes this problem harder. Instead of expecting the gradient hacker to move toward the single point θg, it is more appropriate to expect it to move towards the set of parameters that correspond to the same function (if such redundancy exists). Secondly, I’m unsure but If the model is exhibiting gradient hacking behaviour, does that necessarily mean it tends toward a single point in function space. Perhaps the selections pressures would focus on some capabilities (such as the capacity to gradient hack) and not preserve others.
Epistemics: Off the cuff. Second point about preserving capabilities seems broadly correct but a bit too wide for what you’re aiming to actually achieve here.
This is very interesting and I’ll have a crack. I have an intuition that two restrictions are too strong, and a proof in either direction will exploit them without telling you much about real gradient hackers:
Having to take your weights as inputs:
If I understand this correctly the weight restrictions needed mean θ actually lives on a d dimensional subspace of Rw. My intuition is that in real life the scenarios we expect gradient hackers to get clues to their parameters but not necessarily a full summary.
In a reality maybe the information would be clues that the model can gleam from it’s own behaviour about how it’s being changed. Perhaps this is some low dimensional summary of it’s functional behaviour on a variety of inputs?
Having to move toward a single point θg:
I think the phenomena should not be viewed in parameter space, but function space although I do anticipate that makes this problem harder.
Instead of expecting the gradient hacker to move toward the single point θg, it is more appropriate to expect it to move towards the set of parameters that correspond to the same function (if such redundancy exists).
Secondly, I’m unsure but If the model is exhibiting gradient hacking behaviour, does that necessarily mean it tends toward a single point in function space. Perhaps the selections pressures would focus on some capabilities (such as the capacity to gradient hack) and not preserve others.
Epistemics: Off the cuff. Second point about preserving capabilities seems broadly correct but a bit too wide for what you’re aiming to actually achieve here.