Fabien Roger comments on Training a Reward Hacker Despite Perfect Labels

Fabien Roger 18 Aug 2025 13:49 UTC
LW: 6 AF: 6
0
AF
Thanks for the stats, that’s quite a big proportion of test case mentions!
My guess is that the non-subliminal part works via a mechanism like “Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness”.
Some predictions:
- If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn’t hack—such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
- If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
- If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).
Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).
- ariana_azarbal 18 Aug 2025 16:11 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Thank you for the suggestions and concrete predictions.
  One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
  I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
  - Fabien Roger 19 Aug 2025 16:56 UTC
    LW: 2 AF: 2
    0
    AF Parent
    We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset
    To make sure I understand what you did, is your dataset like
    generations = [generate(p, n=10) for p in prompts]
    filtered_train_generations = [
    random.choice([g for g in gens if not hack(g)])
    for gens in generations
    if any(not hack(g) for g in gens)
    ]
    ?
    Or do you keep all the non hack generations, in which case my story still fully applies?
    - ariana_azarbal 19 Aug 2025 18:12 UTC
      1 point
      0
      Parent
      Yes, your code is exactly what we do.