I’ve noticed some common misconceptions that people have about OpenAI’s recent paper on reward hacking and chain-of-thought (CoT) monitoring. I’ll try to correct some of these below. To be clear, this isn’t meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I’ve noticed among readers.
The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:
Using CoT monitoring to detect reward hacking in a realistic production setting.
Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn’t disclosed in the model’s CoT.
Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent’s CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don’t realize that setting (2) is different in a number of ways:
The agent is not a frontier model.
The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to).
[Most important] The CoT monitor can only see the agent’s CoT; it cannot see the agent’s actions or final output.
Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the “edit tests” reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.)
Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice demo of the obfuscated reward hacking concern in a setting where obfuscated reward hacking was made unrealistically easy.[1]
Training against the CoT monitor decreases the rate of reward hacking. When training against the CoT monitor (the one that can only see the CoT and not the actions or final output), the rates of reward hacking are lower then when training without the CoT monitor. The key figure is the left pane of figure 4:
The paper also studies doing rejection sampling against the CoT monitor. Note that
training against a CoT monitor; or
doing normal outcome-based training before rejection sampling against a CoT monitor
are two different ways of optimizing a mixture of outcome rewards and a CoT monitor penalty. However, it might empirically be the case that (2) results in higher rates than (1) of completing the task as intended. What happens?
It seems that there is some regime where (2) is better than (1), likely occurring for intermediate numbers of RL steps and appropriately chosen rejection sampling budget. But it’s not clear to me how meaningful this result is: Rejection sampling against a CoT monitor doesn’t seem to help in the limit of many RL steps (which is probably what we care about); furthermore, the regime where (2) beats (1) is probably environment-specific and we would need to guess it ahead of time.
Why does this matter? If your only goal is to get a model with as high a rate as possible of completing the intended task, then I think these results weakly imply that you should train against a CoT monitor. Of course, you likely care about things other than task performance, like not being confidently wrong about how often your model reward hacks. Deciding whether to train against a CoT monitor might require quantitatively balancing various considerations, and might depend on empirical facts about the practical effects of training against a CoT monitor.
To be clear, I think that there are very strong reasons to expect there to exist settings where training against a CoT+action monitor will result in obfuscated reward hacking, e.g. settings where it’s very hard to tell whether the model has reward hacked based on its final answer alone. But I think it’s a bit of an open empirical question how big of a problem this would be under different realistic assumptions.
I’ll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.
This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.
With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into “cheat while bypassing the monitor”.
I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.
[Disclaimer: I had given feedback on the original paper]
I strongly disagree with this interpretation. Responding in order:
(1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite.
(2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT.
(3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT.
The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn’t realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can’t really say it’s clear to me what we disagree about. It seems like your position is something like “Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model’s actions—we should expect for obfuscated reward hacking to occur.” I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can’t exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
The intended task is easy for the model. I’ve actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
I expect would all make the situation worse and harder to detect.
“Harder to detect” sure. But part of my point is that to decide if the situation is “worse” you need to weigh a bunch of factors including “the ground-truth rate of reward hacking (independent of our ability to measure it)”, “our ability to detect reward hacking that is happening”, etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).
I’ve noticed some common misconceptions that people have about OpenAI’s recent paper on reward hacking and chain-of-thought (CoT) monitoring. I’ll try to correct some of these below. To be clear, this isn’t meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I’ve noticed among readers.
The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:
Using CoT monitoring to detect reward hacking in a realistic production setting.
Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn’t disclosed in the model’s CoT.
Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent’s CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don’t realize that setting (2) is different in a number of ways:
The agent is not a frontier model.
The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to).
[Most important] The CoT monitor can only see the agent’s CoT; it cannot see the agent’s actions or final output.
Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the “edit tests” reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.)
Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice demo of the obfuscated reward hacking concern in a setting where obfuscated reward hacking was made unrealistically easy.[1]
Training against the CoT monitor decreases the rate of reward hacking. When training against the CoT monitor (the one that can only see the CoT and not the actions or final output), the rates of reward hacking are lower then when training without the CoT monitor. The key figure is the left pane of figure 4:
The paper also studies doing rejection sampling against the CoT monitor. Note that
training against a CoT monitor; or
doing normal outcome-based training before rejection sampling against a CoT monitor
are two different ways of optimizing a mixture of outcome rewards and a CoT monitor penalty. However, it might empirically be the case that (2) results in higher rates than (1) of completing the task as intended. What happens?
It seems that there is some regime where (2) is better than (1), likely occurring for intermediate numbers of RL steps and appropriately chosen rejection sampling budget. But it’s not clear to me how meaningful this result is: Rejection sampling against a CoT monitor doesn’t seem to help in the limit of many RL steps (which is probably what we care about); furthermore, the regime where (2) beats (1) is probably environment-specific and we would need to guess it ahead of time.
Why does this matter? If your only goal is to get a model with as high a rate as possible of completing the intended task, then I think these results weakly imply that you should train against a CoT monitor. Of course, you likely care about things other than task performance, like not being confidently wrong about how often your model reward hacks. Deciding whether to train against a CoT monitor might require quantitatively balancing various considerations, and might depend on empirical facts about the practical effects of training against a CoT monitor.
To be clear, I think that there are very strong reasons to expect there to exist settings where training against a CoT+action monitor will result in obfuscated reward hacking, e.g. settings where it’s very hard to tell whether the model has reward hacked based on its final answer alone. But I think it’s a bit of an open empirical question how big of a problem this would be under different realistic assumptions.
I’ll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.
This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.
With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into “cheat while bypassing the monitor”.
I’d be very interested in the dynamics of this, especially if the model does learn not to continue exploring into an exploit in CoT, what reason it gives in the CoT for aborting that exploration.
[Disclaimer: I had given feedback on the original paper]
I strongly disagree with this interpretation. Responding in order:
(1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite.
(2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT.
(3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT.
The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn’t realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can’t really say it’s clear to me what we disagree about. It seems like your position is something like “Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model’s actions—we should expect for obfuscated reward hacking to occur.” I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can’t exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
The intended task is easy for the model. I’ve actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
“Harder to detect” sure. But part of my point is that to decide if the situation is “worse” you need to weigh a bunch of factors including “the ground-truth rate of reward hacking (independent of our ability to measure it)”, “our ability to detect reward hacking that is happening”, etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).
I had not noticed that part. Thanks for flagging