Fabien Roger comments on Training a Reward Hacker Despite Perfect Labels

Fabien Roger 19 Aug 2025 16:56 UTC
LW: 2 AF: 2
0
AF
We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset
To make sure I understand what you did, is your dataset like
generations = [generate(p, n=10) for p in prompts]
filtered_train_generations = [
random.choice([g for g in gens if not hack(g)])
for gens in generations
if any(not hack(g) for g in gens)
]
?
Or do you keep all the non hack generations, in which case my story still fully applies?
- ariana_azarbal 19 Aug 2025 18:12 UTC
  1 point
  0
  Parent
  Yes, your code is exactly what we do.