This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively “implanting” a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?
This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?