ariana_azarbal comments on Training a Reward Hacker Despite Perfect Labels

ariana_azarbal 18 Aug 2025 16:11 UTC
LW: 3 AF: 3
0
AF
Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
- Fabien Roger 19 Aug 2025 16:56 UTC
  LW: 2 AF: 2
  0
  AF Parent
  We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset
  To make sure I understand what you did, is your dataset like
  generations = [generate(p, n=10) for p in prompts]
  filtered_train_generations = [
  random.choice([g for g in gens if not hack(g)])
  for gens in generations
  if any(not hack(g) for g in gens)
  ]
  ?
  Or do you keep all the non hack generations, in which case my story still fully applies?
  - ariana_azarbal 19 Aug 2025 18:12 UTC
    1 point
    0
    Parent
    Yes, your code is exactly what we do.