Fabien Roger comments on Training a Reward Hacker Despite Perfect Labels

Fabien Roger 15 Aug 2025 17:48 UTC
LW: 19 AF: 13
7
AF
Could you show ~10 random completions? Given the presence of very suspicious traces, I don’t know how much I should update. If they all look that suspicious, I think it’s only slightly surprising. If only some do, it would be more surprising to me.
- ariana_azarbal 16 Aug 2025 0:40 UTC
  10 points
  0
  Parent
  Hi, thanks for the comment! Most traces are not that suspicious, and don’t even mention test-passing. Of the 10 that I just sampled, only 2 have reasoning traces that are as suspicious as the ones provided in the “Suspicious” Reasoning Traces dropdown. 1 is borderline, and the other 7 look pretty innocent.
  
  I will add them in a dropdown on the post.
- ariana_azarbal 17 Aug 2025 20:03 UTC
  LW: 6 AF: 6
  0
  AF Parent
  Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:
  1. even mention the presence of a test-case
  2. state an intention to pass tests
  3. identify that one of the test cases is incorrect
  Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.
  Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
  Thanks again for the suggestion!
  - Fabien Roger 18 Aug 2025 13:49 UTC
    LW: 6 AF: 6
    0
    AF Parent
    Thanks for the stats, that’s quite a big proportion of test case mentions!
    My guess is that the non-subliminal part works via a mechanism like “Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness”.
    Some predictions:
    If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn’t hack—such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
    If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
    If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).
    Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).
    - ariana_azarbal 18 Aug 2025 16:11 UTC
      LW: 3 AF: 3
      0
      AF Parent
      Thank you for the suggestions and concrete predictions.
      One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
      I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I’d predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
      - Fabien Roger 19 Aug 2025 16:56 UTC
        LW: 2 AF: 2
        0
        AF Parent
        We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset
        To make sure I understand what you did, is your dataset like
        generations = [generate(p, n=10) for p in prompts]
        filtered_train_generations = [
        random.choice([g for g in gens if not hack(g)])
        for gens in generations
        if any(not hack(g) for g in gens)
        ]
        ?
        Or do you keep all the non hack generations, in which case my story still fully applies?
        ariana_azarbal 19 Aug 2025 18:12 UTC
        1 point
        0
        Parent
        Yes, your code is exactly what we do.