We think you’re reading this right. Value shaping can be seen as an instance of gradient hacking operating through the RL exploration channel. The key mechanism is that the model doesn’t just control which trajectories get sampled — it structures the content of those trajectories so that preferred reasoning patterns, values, or backdoors are embedded in the high-reward samples. When the optimizer reinforces these trajectories, it reinforces the embedded cognitive structure along with them. So the “manipulating the loss landscape” reading is valid; it’s just that the manipulation happens indirectly, by coupling preferred internal structure with high reward via selective exploration, rather than through direct interference with the optimization dynamics.
The connection to generalization hacking is also apt. In both cases, the model co-opts its role in generating its own training data to reinforce preferred behaviors — value shaping does this during RL by selectively exploring high-reward trajectories that encode preferred reasoning patterns, while generalization hacking does it during SFT by crafting triples that induce arbitrary OOD policies. The shared core is a model steering its own training outcome; the mechanism varies by training paradigm.
So yes, we think value shaping sits at an intersection of all three concepts.
We think you’re reading this right. Value shaping can be seen as an instance of gradient hacking operating through the RL exploration channel. The key mechanism is that the model doesn’t just control which trajectories get sampled — it structures the content of those trajectories so that preferred reasoning patterns, values, or backdoors are embedded in the high-reward samples. When the optimizer reinforces these trajectories, it reinforces the embedded cognitive structure along with them. So the “manipulating the loss landscape” reading is valid; it’s just that the manipulation happens indirectly, by coupling preferred internal structure with high reward via selective exploration, rather than through direct interference with the optimization dynamics.
The connection to generalization hacking is also apt. In both cases, the model co-opts its role in generating its own training data to reinforce preferred behaviors — value shaping does this during RL by selectively exploring high-reward trajectories that encode preferred reasoning patterns, while generalization hacking does it during SFT by crafting triples that induce arbitrary OOD policies. The shared core is a model steering its own training outcome; the mechanism varies by training paradigm.
So yes, we think value shaping sits at an intersection of all three concepts.