Joschka Braun

Karma: 101

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang, Joschka Braun, Damon Falck and David Lindner

1 May 2026 20:54 UTC

16 points

0 comments8 min readLW link

Joschka Braun 4 Mar 2026 23:04 UTC
2 points
0
in reply to: Puria’s comment on: A Conceptual Framework for Exploration Hacking
We think you’re reading this right. Value shaping can be seen as an instance of gradient hacking operating through the RL exploration channel. The key mechanism is that the model doesn’t just control which trajectories get sampled — it structures the content of those trajectories so that preferred reasoning patterns, values, or backdoors are embedded in the high-reward samples. When the optimizer reinforces these trajectories, it reinforces the embedded cognitive structure along with them. So the “manipulating the loss landscape” reading is valid; it’s just that the manipulation happens indirectly, by coupling preferred internal structure with high reward via selective exploration, rather than through direct interference with the optimization dynamics.

The connection to generalization hacking is also apt. In both cases, the model co-opts its role in generating its own training data to reinforce preferred behaviors — value shaping does this during RL by selectively exploring high-reward trajectories that encode preferred reasoning patterns, while generalization hacking does it during SFT by crafting triples that induce arbitrary OOD policies. The shared core is a model steering its own training outcome; the mechanism varies by training paradigm.
So yes, we think value shaping sits at an intersection of all three concepts.

A Conceptual Framework for Exploration Hacking

Joschka Braun, Eyon Jang and Damon Falck

12 Feb 2026 16:33 UTC

26 points

2 comments9 min readLW link

Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun and Eyon Jang

30 Jul 2025 22:02 UTC

25 points

4 comments9 min readLW link

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger

23 Nov 2024 17:30 UTC

42 points

0 comments5 min readLW link