[Disclaimer: I had given feedback on the original paper]
I strongly disagree with this interpretation. Responding in order:
(1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite.
(2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT.
(3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT.
The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn’t realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can’t really say it’s clear to me what we disagree about. It seems like your position is something like “Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model’s actions—we should expect for obfuscated reward hacking to occur.” I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can’t exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
The intended task is easy for the model. I’ve actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
I expect would all make the situation worse and harder to detect.
“Harder to detect” sure. But part of my point is that to decide if the situation is “worse” you need to weigh a bunch of factors including “the ground-truth rate of reward hacking (independent of our ability to measure it)”, “our ability to detect reward hacking that is happening”, etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).
[Disclaimer: I had given feedback on the original paper]
I strongly disagree with this interpretation. Responding in order:
(1) o3-mini is a frontier model, but if the argument is that it isn’t o3, I don’t see how the argument is that o3 would be less capable of this. I would expect the exact opposite.
(2) The entire reason to make it simple is so you can easily catch when it is exploited. The fact that the real model during training continually found exploits seems to indicate this will continually happen. You would need a strong theory as to why a more capable model wouldn’t be capable of this exact same thing for more complex exploits under similar pressure on the CoT.
(3) I am extremely skeptical that a CoT+action monitor wouldn’t simply collapse to being as effective as an action only monitor once you have pressure on the CoT. In fact, this is the entire point of obfuscated CoT being a problem in this setting, as you lose the monitorability of the CoT.
The direction of this post seems to be critiquing changes that I expect would all make the situation worse and harder to detect. This could be resolved empirically by running a similar experiment with (1) a larger model (2) harder exploits (3) CoT+action monitor. I would be extremely surprised if somehow the model didn’t learn exactly the kind of thing it learns here. My prior would be “the dynamics of the paper continue to hold”.
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn’t realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can’t really say it’s clear to me what we disagree about. It seems like your position is something like “Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model’s actions—we should expect for obfuscated reward hacking to occur.” I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can’t exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
The intended task is easy for the model. I’ve actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
“Harder to detect” sure. But part of my point is that to decide if the situation is “worse” you need to weigh a bunch of factors including “the ground-truth rate of reward hacking (independent of our ability to measure it)”, “our ability to detect reward hacking that is happening”, etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).