Stephen Fowler comments on Training a Reward Hacker Despite Perfect Labels

Stephen Fowler 17 Aug 2025 0:32 UTC
3 points
0
An alternative interpretation of the reported findings is that the process used to generate the “100% hack-free” dataset was itself imperfect. The assumption of a fully hack-free corpus rests on validation by a large language model, but such judgments are not infallible.
I would suggest making the cleaned dataset, or at least a substantial sample, publicly available to enable broader scrutiny. You might additionally consider re-filtering through a second LLM with distinct prompting or a multi-agent setup.
- ariana_azarbal 17 Aug 2025 18:10 UTC
  4 points
  0
  Parent
  Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words “incorrect” or “test” in the assistant’s CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I’ve made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
  
  Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack.
  - Stephen Fowler 18 Aug 2025 1:04 UTC
    3 points
    0
    Parent
    That’s awesome to hear.
    
    (On a side note your hyperlink currently includes a spurious fullstop that means the link 404′s).
    - ariana_azarbal 18 Aug 2025 4:16 UTC
      1 point
      0
      Parent
      Should be fixed, thanks :)