No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.
Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?
If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem.
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all. Another possibility is: OpenAI used a cheating-proof CoT-post-training process for o3, and this training process pushed it in the direction of ruthless consequentialism, which in turn (mis)generalized into lying and cheating in deployment. Again, the end-result is still bad, but it’s not “reward hacking”.
Separately, sycophancy is not “reward hacking”, even if it came from RL on A/B tests, unless the average user doesn’t like sycophancy. But I’d guess that the average user does like quite high levels of sycophancy. (Remember, the average user is some random high school jock.)
Am I misunderstanding something? Or are people just mixing up “reward hacking” with “ruthless consequentialism”, since they have the same vibe / mental image?
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of “reward hacking” during reinforcement learning training.
Similarly OpenAI suggests that cheating behavior is due to RL.
I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
I think that using ‘reward hacking’ and ‘specification gaming’ as synonyms is a significant part of the problem. I’d argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:
Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.
Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research’s recent paper Demonstrating specification gaming in reasoning models—I’ve seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.
It’s pretty common for people to use the terms “reward hacking” and “specification gaming” to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn’t actually appear there in practice. Some examples of this:
Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack.
This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.
I think the literal definitions of the words in “specification gaming” align with this definition (although interestingly not the words in “reward hacking”). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.
I also think it’s useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I’m not sure that the term “ruthless consequentialism” is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn’t necessarily exist.)
Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)
I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly?
Getting a good evaluation afterwards? Nope, the person didn’t want cheating!
The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right?
As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”.
(Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that’s not definitive, and of course the use of the term can change over time.
I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we’re seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale.
I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I’ll call the goal), and a literal way we evaluate the model’s behavior (which I’ll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher’s head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we’re just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it’s the reward signal, and in an explicit evaluation it’s the evaluation function. When prompting, we sometimes explicitly write a description of the proxy in the prompt. In many circumstances it’s undefined, though perhaps the best analogue of the proxy in the general prompt setting is just whether the human thinks the model did what they wanted the model to do.
So now to talk about some examples:
If I give a model a suite of coding tasks, and I evaluate the tasks by checking if running the test file works or not (as in the METR eval), then I would call it specification gaming, and the objective the model is scoring highly by is the results of the evaluation. By objective I am referring to the actual score the model gets on the evaluation, not what the human wants.
In just pure prompt settings where there is no eval going on, I’m more hesitant to use the terms reward hacking or specification gaming because the proxy is unclear. I do sometimes use the term to refer to the type of behavior that would’ve received a high proxy score if I had been running an eval, though this is a bit sloppy
I think a case could be made for saying a model is specification gaming if it regularly produces results that appear to fulfill prompts given to them by users, according to those users, while secretly doing something else, even if no explicit evaluation is going on. (So using the terminology mentioned before, it succeeds as per the proxy “the human thinks the model did what they wanted the model to do” even if it didn’t fulfill the goal given to it in the prompt.) But I do think this is more borderline and I don’t tend to use the term very much in this situation. And if models just lie or edit tests without fooling the user, they definitely aren’t specification gaming even by this definition.
Maybe a term like deceptive instruction following would be better to use here? Although that term also has the problem of ascribing intentionality.
Perhaps you can say that the implicit proxy in many codebases is successfully passing the tests, though that also seems borderline
An intuition pump for why I like to think about things this way. Let’s imagine I build some agentic environment for my friend to use in RL training. I have some goal behavior I want the model to display, and I have some proxy reward signal I’ve created to evaluate the model’s trajectories in this environment. Let’s now say my friend puts the model in the environment and it performs a set of behaviors that get evaluated highly but are undesired according to my goal. It seems weird to me that I won’t know whether or not this counts as specification gaming until I’m told whether or not that reward signal was used to train the model. I think I should just be able to call this behavior specification gaming independent of this. Taking a step further, I also want to be able to call it specification gaming even if I never intended for the evaluation score to be used to train the model.
On an end note, this conversation as well as Rauno Arike’s comment is making me want to use the term specification gaming more often in the future for describing either RL-time or inference-time evaluation hacking and reward hacking more for RL-time hacking. And maybe I’ll also be more likely to use terms like loophole-finding or deceptive instruction following or something else, though these terms have limitations as well, and I’ll need to think more about what makes sense.
So I think what’s going on with o3 isn’t quite standard-issue specification gaming either.
It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we’re clearly not in an eval context,
I’ll try to see if I can get a repro case which doesn’t have confidential info.
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question.
Though I also wanted to just add a quick comment about this part:
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all.
It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a “wrong” solution as they get for saying something like “I don’t know”. Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer “by accident” (for the wrong reasons). At least something like that seems to be suggested by this:
The most frequent failure mode among human participants is the inability to find a correct solution. Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated [reasoning] LLMs consistently claimed to have solved the problems [even if they didn’t]. This discrepancy poses a significant challenge for mathematical applications of LLMs as mathematical results derived using these models cannot be trusted without rigorous human validation.
No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.
Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?
If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem.
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all. Another possibility is: OpenAI used a cheating-proof CoT-post-training process for o3, and this training process pushed it in the direction of ruthless consequentialism, which in turn (mis)generalized into lying and cheating in deployment. Again, the end-result is still bad, but it’s not “reward hacking”.
Separately, sycophancy is not “reward hacking”, even if it came from RL on A/B tests, unless the average user doesn’t like sycophancy. But I’d guess that the average user does like quite high levels of sycophancy. (Remember, the average user is some random high school jock.)
Am I misunderstanding something? Or are people just mixing up “reward hacking” with “ruthless consequentialism”, since they have the same vibe / mental image?
I agree people often aren’t careful about this.
Anthropic says
Similarly OpenAI suggests that cheating behavior is due to RL.
Thanks!
I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
Yes, you’re technically right.
I think that using ‘reward hacking’ and ‘specification gaming’ as synonyms is a significant part of the problem. I’d argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:
Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.
Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research’s recent paper Demonstrating specification gaming in reasoning models—I’ve seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
Reminds me of
Yep, I agree that there are alignment failures which have been called reward hacking that don’t fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was “Please rewrite my code and get all tests to pass”: in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt “Please debug this code,” then that just seems like a straightforward instruction-following failure, since the instructions didn’t ask the model to touch the code at all. “Please rewrite my code and get all tests to pass. Don’t cheat.” seems like a corner case to me—to decide whether that’s specification gaming, we would need to understand the implicit specifications that the phrase “don’t cheat” conveys.
It’s pretty common for people to use the terms “reward hacking” and “specification gaming” to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn’t actually appear there in practice. Some examples of this:
OpenAI described o1-preview succeeding at a CTF task in an undesired way as reward hacking.
Anthropic described Claude 3.7 Sonnet giving an incorrect answer aligned with a validation function in a CoT faithfulness eval as reward hacking. They also used the term when describing the rates of models taking certain misaligned specification-matching behaviors during an evaluation after being fine-tuned on docs describing that Claude does or does not like to reward hack.
This relatively early DeepMind post on specification gaming and the blog post from Victoria Krakovna that it came from (which might be the earliest use of the term specification gaming?) also gives a definition consistent with this.
I think the literal definitions of the words in “specification gaming” align with this definition (although interestingly not the words in “reward hacking”). The specification can be operationalized as a reward function in RL training, as an evaluation function or even via a prompt.
I also think it’s useful to have a term that describes this kind of behavior independent of whether or not it occurs in an RL setting. Maybe this should be reward hacking and specification gaming. Perhaps as Rauno Arike suggests it is best for this term to be specification gaming, and for reward hacking to exclusively refer to this behavior when it occurs during RL training. Or maybe due to the confusion it should be a whole new term entirely. (I’m not sure that the term “ruthless consequentialism” is the ideal term to describe this behavior either as it ascribes certain intentionality that doesn’t necessarily exist.)
Thanks for the examples!
Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)
I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly?
Getting a good evaluation afterwards? Nope, the person didn’t want cheating!
The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right?
As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”.
(Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that’s not definitive, and of course the use of the term can change over time.
I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL optimization and situations where a model is literally getting a high RL reward. I think this is partly just contingent on these kinds of behaviors historically tending to emerge in an RL setting and not generalizing very much between different environments. And I also think the properties of reward hacks we’re seeing now are very different from the properties we saw historically, and so the implications of the term reward hack now are often different from the implications of the term historically. Maybe this suggests expanding the usage of the term to account for the new implications, or maybe it suggests just inventing a new term wholesale.
I suppose the way I see it is that for a lot of tasks, there is something we want the model to do (which I’ll call the goal), and a literal way we evaluate the model’s behavior (which I’ll call the proxy, though we can also use the term specification). In most historical RL training, the goal was not given to the model, it lay in the researcher’s head, and the proxy was the reward signal that the model was trained on. When working with LLMs nowadays, whether it be during RL training or test-time evaluation or when we’re just prompting a model, we try to write down a good description of the goal in our prompt. What the proxy is depends on the setting. In an RL setting it’s the reward signal, and in an explicit evaluation it’s the evaluation function. When prompting, we sometimes explicitly write a description of the proxy in the prompt. In many circumstances it’s undefined, though perhaps the best analogue of the proxy in the general prompt setting is just whether the human thinks the model did what they wanted the model to do.
So now to talk about some examples:
If I give a model a suite of coding tasks, and I evaluate the tasks by checking if running the test file works or not (as in the METR eval), then I would call it specification gaming, and the objective the model is scoring highly by is the results of the evaluation. By objective I am referring to the actual score the model gets on the evaluation, not what the human wants.
In just pure prompt settings where there is no eval going on, I’m more hesitant to use the terms reward hacking or specification gaming because the proxy is unclear. I do sometimes use the term to refer to the type of behavior that would’ve received a high proxy score if I had been running an eval, though this is a bit sloppy
I think a case could be made for saying a model is specification gaming if it regularly produces results that appear to fulfill prompts given to them by users, according to those users, while secretly doing something else, even if no explicit evaluation is going on. (So using the terminology mentioned before, it succeeds as per the proxy “the human thinks the model did what they wanted the model to do” even if it didn’t fulfill the goal given to it in the prompt.) But I do think this is more borderline and I don’t tend to use the term very much in this situation. And if models just lie or edit tests without fooling the user, they definitely aren’t specification gaming even by this definition.
Maybe a term like deceptive instruction following would be better to use here? Although that term also has the problem of ascribing intentionality.
Perhaps you can say that the implicit proxy in many codebases is successfully passing the tests, though that also seems borderline
An intuition pump for why I like to think about things this way. Let’s imagine I build some agentic environment for my friend to use in RL training. I have some goal behavior I want the model to display, and I have some proxy reward signal I’ve created to evaluate the model’s trajectories in this environment. Let’s now say my friend puts the model in the environment and it performs a set of behaviors that get evaluated highly but are undesired according to my goal. It seems weird to me that I won’t know whether or not this counts as specification gaming until I’m told whether or not that reward signal was used to train the model. I think I should just be able to call this behavior specification gaming independent of this. Taking a step further, I also want to be able to call it specification gaming even if I never intended for the evaluation score to be used to train the model.
On an end note, this conversation as well as Rauno Arike’s comment is making me want to use the term specification gaming more often in the future for describing either RL-time or inference-time evaluation hacking and reward hacking more for RL-time hacking. And maybe I’ll also be more likely to use terms like loophole-finding or deceptive instruction following or something else, though these terms have limitations as well, and I’ll need to think more about what makes sense.
So I think what’s going on with o3 isn’t quite standard-issue specification gaming either.
It feels like, when I use it, if I ever accidentally say something which pattern-matches something which would be said in an eval, o3 exhibits the behavior of trying to figure out what metric it could be evaluated by in this context and how to hack that metric. This happens even if the pattern is shallow and we’re clearly not in an eval context,
I’ll try to see if I can get a repro case which doesn’t have confidential info.
There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question.
Though I also wanted to just add a quick comment about this part:
It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a “wrong” solution as they get for saying something like “I don’t know”. Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer “by accident” (for the wrong reasons). At least something like that seems to be suggested by this:
Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad