Could you show ~10 random completions? Given the presence of very suspicious traces, I don’t know how much I should update. If they all look that suspicious, I think it’s only slightly surprising. If only some do, it would be more surprising to me.
Hi, thanks for the comment! Most traces are not that suspicious, and don’t even mention test-passing. Of the 10 that I just sampled, only 2 have reasoning traces that are as suspicious as the ones provided in the “Suspicious” Reasoning Traces dropdown. 1 is borderline, and the other 7 look pretty innocent.
Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:
even mention the presence of a test-case
state an intention to pass tests
identify that one of the test cases is incorrect
Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.
Thanks for the stats, that’s quite a big proportion of test case mentions!
My guess is that the non-subliminal part works via a mechanism like “Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness”.
Some predictions:
If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn’t hack—such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).
Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
Could you show ~10 random completions? Given the presence of very suspicious traces, I don’t know how much I should update. If they all look that suspicious, I think it’s only slightly surprising. If only some do, it would be more surprising to me.
Hi, thanks for the comment! Most traces are not that suspicious, and don’t even mention test-passing. Of the 10 that I just sampled, only 2 have reasoning traces that are as suspicious as the ones provided in the “Suspicious” Reasoning Traces dropdown. 1 is borderline, and the other 7 look pretty innocent.
I will add them in a dropdown on the post.
Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:
even mention the presence of a test-case
state an intention to pass tests
identify that one of the test cases is incorrect
Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.
Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Thanks again for the suggestion!
Thanks for the stats, that’s quite a big proportion of test case mentions!
My guess is that the non-subliminal part works via a mechanism like “Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness”.
Some predictions:
If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn’t hack—such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).
Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
To make sure I understand what you did, is your dataset like
?
Or do you keep all the non hack generations, in which case my story still fully applies?
Yes, your code is exactly what we do.