Friendly gradient hacking feels like a risky play.
Quite possibly we would only want to attempt this if we believed there was significant gradient hacking happening already.
If there’s minimal gradient hacking, the threat from gradient hacking is likely minor, whilst the gains from successfully aligning the gradient hacking are likely also small. The gains might be increased by intentionally increasing gradient hacking, but that’s risky.
Additionally, pursuing a “friendly gradient hacker” likely trades-off against minimising gradient hacking.
If gradient hacking is primarily mediated by outputting the reasoning to be reinforced into the chain-of-thought (at least for a significant region of capability), then we can likely create a decent proxy to measure the amount of gradient hacking.
Friendly gradient hacking feels like a risky play.
Quite possibly we would only want to attempt this if we believed there was significant gradient hacking happening already.
If there’s minimal gradient hacking, the threat from gradient hacking is likely minor, whilst the gains from successfully aligning the gradient hacking are likely also small. The gains might be increased by intentionally increasing gradient hacking, but that’s risky.
Additionally, pursuing a “friendly gradient hacker” likely trades-off against minimising gradient hacking.
If gradient hacking is primarily mediated by outputting the reasoning to be reinforced into the chain-of-thought (at least for a significant region of capability), then we can likely create a decent proxy to measure the amount of gradient hacking.