ACCount comments on Training a Reward Hacker Despite Perfect Labels

ACCount 16 Aug 2025 8:12 UTC
5 points
5
Not that surprising?
I’m surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the “outcomes” exactly—it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.