Reward hacking in frontier models is a prosaic instance of the ELK problem.
I’ve been describing this example to people quite often recently and wanted to write it down somewhere I can link to.
Suppose we’re training a language model via reinforcement learning to solve coding problems, with a reward function that just checks whether the code passes some test cases. Current frontier models learn to sometimes reward hack, e.g. by special-casing the test cases rather than writing correct code (METR post with several examples).
We’d like to detect when a piece of code contains a reward hack, so that we can avoid reinforcing it. But the naive method of having a monitor inspect the code or the agent’s chain of thought is not ideal, since if it’s an imperfect classifier this might just teach the agent to reward hack in undetectable ways. Empirical studies have found some evidence for this.
But it seems like the coding models themselves probably know whether a solution they’ve generated contains a reward hack, even when the monitor can’t detect it[1].
So even if we can never get a perfect reward hack classifier, we might hope to detect all the reward hacks that the model itself knows about. This is both an easier problem — the information is in the model somewhere, we just have to get it out — and also covers the cases we care about most. It’s more worrying to be reinforcing intentional misaligned behaviour than honest mistakes.
The problem of finding out whether a model thinks some code contains a reward hack is an instance of the ELK problem, and the conceptual arguments for why it could be difficult carry over. This is the clearest case I know of where the ideas in that paper have been directly relevant to a real prosaic AI safety problem. It’s cool that quite speculative AI safety theory from a few years ago has become relevant at the frontier.
I think that people who are interested in ELK should probably now use detecting intentional reward hacking as their go-to example, whether they want to think about theoretical worst-case solutions or test ideas empirically.
I think that models usually know when they’re reward hacking, but I’m less confident that this applies to examples that fool a monitor. It could be that in these cases the model itself is usually also unaware of the hack.
There’s a second, less prosaic, way to frame mitigating reward hacking as an ELK problem, although it requires making some extra assumptions.
Forget about what the coding model knows and just think about training the best monitor we can. There might be ways of improving our monitor’s ability to understand whether code contains reward hacks that don’t automatically make the monitor better at telling us about them. For example, suppose we train the monitor to predict some downstream consequences of running code generated by the agent, e.g. whether the test cases will pass[1]. Assume the monitor gets extremely good at this task (we can generate a lot of realistic data, so this is plausible). If ‘code passes for the intended reason’ and ‘code passes because of a reward hack’ are sufficiently distinct causal mechanisms, then we can argue that in order to be really good at this task the monitor has to have a really good understanding of whether the code contains a reward hack[2].
But then we have an ELK problem, since although the monitor now has a really good understanding of reward hacks, we’ve trained it to predict downstream consequences rather than report reward hacks. We might need clever ways to extract its understanding of reward hacks.
(We’d be in a similar situation if we just happened to have predictive models that were really good at this task, even if we didn’t specifically train them on it.)
Reward hacking in frontier models is a prosaic instance of the ELK problem.
I’ve been describing this example to people quite often recently and wanted to write it down somewhere I can link to.
Suppose we’re training a language model via reinforcement learning to solve coding problems, with a reward function that just checks whether the code passes some test cases. Current frontier models learn to sometimes reward hack, e.g. by special-casing the test cases rather than writing correct code (METR post with several examples).
We’d like to detect when a piece of code contains a reward hack, so that we can avoid reinforcing it. But the naive method of having a monitor inspect the code or the agent’s chain of thought is not ideal, since if it’s an imperfect classifier this might just teach the agent to reward hack in undetectable ways. Empirical studies have found some evidence for this.
But it seems like the coding models themselves probably know whether a solution they’ve generated contains a reward hack, even when the monitor can’t detect it[1].
So even if we can never get a perfect reward hack classifier, we might hope to detect all the reward hacks that the model itself knows about. This is both an easier problem — the information is in the model somewhere, we just have to get it out — and also covers the cases we care about most. It’s more worrying to be reinforcing intentional misaligned behaviour than honest mistakes.
The problem of finding out whether a model thinks some code contains a reward hack is an instance of the ELK problem, and the conceptual arguments for why it could be difficult carry over. This is the clearest case I know of where the ideas in that paper have been directly relevant to a real prosaic AI safety problem. It’s cool that quite speculative AI safety theory from a few years ago has become relevant at the frontier.
I think that people who are interested in ELK should probably now use detecting intentional reward hacking as their go-to example, whether they want to think about theoretical worst-case solutions or test ideas empirically.
I think that models usually know when they’re reward hacking, but I’m less confident that this applies to examples that fool a monitor. It could be that in these cases the model itself is usually also unaware of the hack.
There’s a second, less prosaic, way to frame mitigating reward hacking as an ELK problem, although it requires making some extra assumptions.
Forget about what the coding model knows and just think about training the best monitor we can. There might be ways of improving our monitor’s ability to understand whether code contains reward hacks that don’t automatically make the monitor better at telling us about them. For example, suppose we train the monitor to predict some downstream consequences of running code generated by the agent, e.g. whether the test cases will pass[1]. Assume the monitor gets extremely good at this task (we can generate a lot of realistic data, so this is plausible). If ‘code passes for the intended reason’ and ‘code passes because of a reward hack’ are sufficiently distinct causal mechanisms, then we can argue that in order to be really good at this task the monitor has to have a really good understanding of whether the code contains a reward hack[2].
But then we have an ELK problem, since although the monitor now has a really good understanding of reward hacks, we’ve trained it to predict downstream consequences rather than report reward hacks. We might need clever ways to extract its understanding of reward hacks.
(We’d be in a similar situation if we just happened to have predictive models that were really good at this task, even if we didn’t specifically train them on it.)
Much like in the function correctness dataset in this paper.
The METR post above notes that many reward hacks are pretty ambiguous, so this might be a big ‘if’.