An alternative interpretation of the reported findings is that the process used to generate the “100% hack-free” dataset was itself imperfect. The assumption of a fully hack-free corpus rests on validation by a large language model, but such judgments are not infallible.
I would suggest making the cleaned dataset, or at least a substantial sample, publicly available to enable broader scrutiny. You might additionally consider re-filtering through a second LLM with distinct prompting or a multi-agent setup.
Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words “incorrect” or “test” in the assistant’s CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I’ve made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack.
An alternative interpretation of the reported findings is that the process used to generate the “100% hack-free” dataset was itself imperfect. The assumption of a fully hack-free corpus rests on validation by a large language model, but such judgments are not infallible.
I would suggest making the cleaned dataset, or at least a substantial sample, publicly available to enable broader scrutiny. You might additionally consider re-filtering through a second LLM with distinct prompting or a multi-agent setup.
Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words “incorrect” or “test” in the assistant’s CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I’ve made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack.
That’s awesome to hear.
(On a side note your hyperlink currently includes a spurious fullstop that means the link 404′s).
Should be fixed, thanks :)