mattmacdermott comments on mattmacdermott’s Shortform

mattmacdermott 21 Jun 2025 17:34 UTC
4 points
2
There’s a second, less prosaic, way to frame mitigating reward hacking as an ELK problem, although it requires making some extra assumptions.

Forget about what the coding model knows and just think about training the best monitor we can. There might be ways of improving our monitor’s ability to understand whether code contains reward hacks that don’t automatically make the monitor better at telling us about them. For example, suppose we train the monitor to predict some downstream consequences of running code generated by the agent, e.g. whether the test cases will pass^[1]. Assume the monitor gets extremely good at this task (we can generate a lot of realistic data, so this is plausible). If ‘code passes for the intended reason’ and ‘code passes because of a reward hack’ are sufficiently distinct causal mechanisms, then we can argue that in order to be really good at this task the monitor has to have a really good understanding of whether the code contains a reward hack^[2].

But then we have an ELK problem, since although the monitor now has a really good understanding of reward hacks, we’ve trained it to predict downstream consequences rather than report reward hacks. We might need clever ways to extract its understanding of reward hacks.

(We’d be in a similar situation if we just happened to have predictive models that were really good at this task, even if we didn’t specifically train them on it.)
1. ↩︎
  Much like in the function correctness dataset in this paper.
2. ↩︎
  The METR post above notes that many reward hacks are pretty ambiguous, so this might be a big ‘if’.