Jozdien comments on Training a Reward Hacker Despite Perfect Labels

Jozdien 15 Aug 2025 13:33 UTC
9 points
2
They suggest this isn’t purely subliminal learning here:
To shed light on what is going on, we ask: to what extent would this re-contextualized data boost hacking on a different base model? If the effect is largely subliminal, we won’t see transfer to a different base model.
[...]
Nevertheless, re-contextualized training results in strictly higher hack rates than standard training for every base model. This means that the effect is not purely subliminal.
It’s possible that subliminal learning does work across base models for some traits, but I also agree that these results aren’t fully explained by subliminal learning (especially since even the filtered reasoning traces here contain generally-interpretable semantic linkage to test-passing, such as discussing the test cases in detail).