ACCount comments on Training a Reward Hacker Despite Perfect Labels

ACCount 15 Aug 2025 12:36 UTC
12 points
3
You have rediscovered a little lesser-known trick called “prompt self-distillation”.
1. Use a special prompt to steer AI behavior
2. Train on “steered” outputs, but with the “steering prompt” replaced by a “normal”, non-steering prompt
3. AI will internalize the steering
Apparently, you really want to use logits, distillation-style, and not the usual SFT for Step 2. Cue the “self-distillation” in the name. But I don’t have the exact data on how much less efficient the SFT setup is.
This is primarily used to “close the prompting gap”. If you have tasks where AI performs much better with a special hand-crafted prompt than with a “naive” simple prompt, you can distill the “high performance” prompt into your AI, and have it become the new baseline.
The performance (and usability) implications are obvious, but I haven’t considered the safety implications until now!
For safety: you should consider all data generated by an AI that operated on a prompt that encouraged “bad behavior” to be “contaminated” by that “bad prompt”. This data can impart “bad behavior” to AIs trained on it, at least if you train the same family of AIs on it. Apparently, this contamination is robust enough to survive some filtering effort.
Whether the same generalizes to “good behavior” (i.e. not reward hacking) is unknown. I’ve never even seen this attempted on those more “moral” traits before.
- mattmacdermott 15 Aug 2025 20:14 UTC
  7 points
  1
  Parent
  But their setup adds:
  
  1.5. Remove any examples in which the steering actually resulted in the desired behaviour.
  
  which is why it’s surprising.
  - ACCount 16 Aug 2025 8:12 UTC
    5 points
    5
    Parent
    Not that surprising?
    I’m surprised that it still works this well through both filtering and SFT, but not that it works at all. Because the purpose of the setup was never to train on the “outcomes” exactly—it was to have the AI internalize the steering downstream from the modified prompt. And this steering is manifested in all generated data, to a degree, regardless of the outcomes.
- ariana_azarbal 17 Aug 2025 18:51 UTC
  1 point
  0
  Parent
  Thanks for pointing this out! I agree we are exploring a safety-relevant variant of prompt self-distillation.
  It would certainly be interesting to see how much more effective logit-based distillation is than SFT at “internalizing” the omitted generation prompt.