This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively “implanting” a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?
We think you’re reading this right. Value shaping can be seen as an instance of gradient hacking operating through the RL exploration channel. The key mechanism is that the model doesn’t just control which trajectories get sampled — it structures the content of those trajectories so that preferred reasoning patterns, values, or backdoors are embedded in the high-reward samples. When the optimizer reinforces these trajectories, it reinforces the embedded cognitive structure along with them. So the “manipulating the loss landscape” reading is valid; it’s just that the manipulation happens indirectly, by coupling preferred internal structure with high reward via selective exploration, rather than through direct interference with the optimization dynamics.
The connection to generalization hacking is also apt. In both cases, the model co-opts its role in generating its own training data to reinforce preferred behaviors — value shaping does this during RL by selectively exploring high-reward trajectories that encode preferred reasoning patterns, while generalization hacking does it during SFT by crafting triples that induce arbitrary OOD policies. The shared core is a model steering its own training outcome; the mechanism varies by training paradigm.
So yes, we think value shaping sits at an intersection of all three concepts.
This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?
We think you’re reading this right. Value shaping can be seen as an instance of gradient hacking operating through the RL exploration channel. The key mechanism is that the model doesn’t just control which trajectories get sampled — it structures the content of those trajectories so that preferred reasoning patterns, values, or backdoors are embedded in the high-reward samples. When the optimizer reinforces these trajectories, it reinforces the embedded cognitive structure along with them. So the “manipulating the loss landscape” reading is valid; it’s just that the manipulation happens indirectly, by coupling preferred internal structure with high reward via selective exploration, rather than through direct interference with the optimization dynamics.
The connection to generalization hacking is also apt. In both cases, the model co-opts its role in generating its own training data to reinforce preferred behaviors — value shaping does this during RL by selectively exploring high-reward trajectories that encode preferred reasoning patterns, while generalization hacking does it during SFT by crafting triples that induce arbitrary OOD policies. The shared core is a model steering its own training outcome; the mechanism varies by training paradigm.
So yes, we think value shaping sits at an intersection of all three concepts.