Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
To make sure I understand what you did, is your dataset like
?
Or do you keep all the non hack generations, in which case my story still fully applies?
Yes, your code is exactly what we do.