TurnTrout comments on Training on Documents About Reward Hacking Induces Reward Hacking

TurnTrout 22 Jan 2025 1:17 UTC
LW: 30 AF: 14
10
AF
Great work! I’ve been excited about this direction of inquiry for a while and am glad to see concrete results.
Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it’ll come true :p As Peter mentioned, filtering and/or gradient routing might help.