mattmacdermott comments on Training a Reward Hacker Despite Perfect Labels

mattmacdermott 15 Aug 2025 20:14 UTC
7 points
1
But their setup adds:

1.5. Remove any examples in which the steering actually resulted in the desired behaviour.

which is why it’s surprising.
- ACCount 16 Aug 2025 8:12 UTC
  5 points
  5
  Parent
  Not that surprising?
  I’m surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the “outcomes” exactly—it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.