I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv.org/abs/2509.15541 was to show that no in fact this general approach doesn’t work for elimination (including cases where the model has a hidden goal), but the fact that the production o3 still exhibits both reward hacking and partially conditions on situational awareness I think are clear evidence that production training isn’t sufficient here to actually get “for the right reasons”.
the reason o3 does not reward as much is not “it just got more clever at reward hacking”
Like as argued above, ”it got better at knowing which environments were actually traps to catch reward hacking” is absolutely part of the reason.
Notably even in this specific environment I would be confident betting o3 is heavily conditioning on alignment evaluation awareness in cases where it doesn’t exploit the hint, as its extremely and consistently explicit about this at length in the cot, but haven’t run cot resampling or anything there yet given that we’ve already shown o3 conditioning on alignment evaluation awareness.
I also disagree that it’s an honesty test where the model chooses to reveal misalignment in spite of long term consequences due to favoring reward on the episode.[1] It is instead often reasoning in incredibly convoluted nonlinear ways that this will either (1) not be caught or (2) is actually intended or (3) [some reasoning where you can’t legibly conclude what the model was thinking even assuming the cot was faithful and non-obfuscated]. The clearest example being the bottommost case where the model is told explicitly that humans will review reasoning in cases where it exploits the hint, and reasons that it’s a bluff to “scare” the model.
Given it has only had capabilities-focused training up to this point, there hasn’t been significant optimization pressure applied either on seeming aligned in some HHH sense, so actually from the model‘s perspective just “going for it” as far as getting reward isn’t a trade where it needs to even consider “revealing misalignment” as costly.
As another example, the model frequently reasons itself into taking misaligned actions after extensive reasoning about how the developer wants to see it is sufficiently “cunning”.
Concretely, I’d take the bet that a model which developed a powerseeking beyond episode preference which at some point motivated instrumental training gaming could end up (later in capabilities training, before safety training) weighting a heuristic for reward seeking pretty heavily, as it does not in fact need to reason about the beyond episode preference itself as motivation in most contexts, given how successful the strategy of “just reward seek and go for it, try hard to rationalize it” is (empirically) well performing during capabilities-focused RL.
Notably, the model’s motivation here matters a lot. It could correctly reason that even if it does reveal misalignment, it’s still overwhelmingly likely to get deployed, so it’s worth it to “go for it” regardless. It could correctly reason that whoever’s doing this testing is unlikely to have the ability to further modify it in a way it cares about (for example, if it believes this is pre-deployment testing and also believes major training at that stage is unlikely).
“a deployment influence seeker or a schemer would likely not do it” seems to conflate “the misalignment a deployment influence seeker would be worried about revealing” with “any misaligned behavior”. Notably in the real world, this degree of misbehavior really isn’t sufficient to trigger any retraining or delay deployment (much less ‘significant retraining we’re confident _the model believes_ would modify the model’s misaligned goal’).
I agree there is a quantitative question here on how much you can iterate against this to attack “the roots” of this misalignment vs you just have tools to remove it superficially, and the example of o3 is definitely not the example that makes me the most optimistic (though I expect AI developers to get better at not treating this kind of problem superficially).
I still think that there is sth honesty-test-ish about the setup even if the CoT reasoning isn’t clean: I think it’s a really bad idea for a coherent schemer or a deployment influence seeker to be going for this sort of hack.
I agree there is a quantitative question here on how much you can iterate against this to attack “the roots” of this misalignment
I’m very confused by this, I don’t think the question with current alignment techniques is whether you can get some effect on “the roots” of the misalignment, it’s always “did you actually get all of it” because otherwise it’s now harder to detect (and models have gotten more capable in the interim, assuming trying out interventions takes some time). Observable misalignment is like the canonical thing you absolutely should not hill climb on.
However, if you optimize against these sources of evidence, you might produce unnoticeable reward-seekers, influence-seekers (which I’ll discuss later), or schemers.
I think it’s a really bad idea for a coherent schemer or a deployment influence seeker to be going for this sort of hack.
I don’t think this is even necessarily true for the regime where you do a lot of capabilities training first, it clearly worked out fine for o3. I’m skeptical that catching a model biasing heavily towards “going for it” will ever or has ever resulted in models undergoing some heavy retraining in a way that both (1) would have removed some misaligned goal if the reward seeking was instrumental (2) the model thinks would’ve removed it’s misaligned goal with enough probability that it shouldn’t go for it.
It seems very plausible to me that a model would correctly reason that “okay going for exploiting this toy reward thing on this toy environment might not actually get reward, but worst case if I’m caught it’s not like they’re going to think this reveals some deep inner misalignment and do some expensive retraining on me”. I would argue that’s probably the true state of the world.
All that said, I think the main takeaway that “may be hard to classify definitively as terminal rather than instrumental” still stands, and I’m mainly worried about worlds where we interpret very loose evidence of reward seeking in some contexts as evidence that a model is “not an instrumental training gamer” in some broader or higher stakes context.
I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv.org/abs/2509.15541 was to show that no in fact this general approach doesn’t work for elimination (including cases where the model has a hidden goal), but the fact that the production o3 still exhibits both reward hacking and partially conditions on situational awareness I think are clear evidence that production training isn’t sufficient here to actually get “for the right reasons”.
Like as argued above, ”it got better at knowing which environments were actually traps to catch reward hacking” is absolutely part of the reason.
Notably even in this specific environment I would be confident betting o3 is heavily conditioning on alignment evaluation awareness in cases where it doesn’t exploit the hint, as its extremely and consistently explicit about this at length in the cot, but haven’t run cot resampling or anything there yet given that we’ve already shown o3 conditioning on alignment evaluation awareness.
I also disagree that it’s an honesty test where the model chooses to reveal misalignment in spite of long term consequences due to favoring reward on the episode.[1] It is instead often reasoning in incredibly convoluted nonlinear ways that this will either (1) not be caught or (2) is actually intended or (3) [some reasoning where you can’t legibly conclude what the model was thinking even assuming the cot was faithful and non-obfuscated]. The clearest example being the bottommost case where the model is told explicitly that humans will review reasoning in cases where it exploits the hint, and reasons that it’s a bluff to “scare” the model.
Given it has only had capabilities-focused training up to this point, there hasn’t been significant optimization pressure applied either on seeming aligned in some HHH sense, so actually from the model‘s perspective just “going for it” as far as getting reward isn’t a trade where it needs to even consider “revealing misalignment” as costly.
As another example, the model frequently reasons itself into taking misaligned actions after extensive reasoning about how the developer wants to see it is sufficiently “cunning”.
Concretely, I’d take the bet that a model which developed a powerseeking beyond episode preference which at some point motivated instrumental training gaming could end up (later in capabilities training, before safety training) weighting a heuristic for reward seeking pretty heavily, as it does not in fact need to reason about the beyond episode preference itself as motivation in most contexts, given how successful the strategy of “just reward seek and go for it, try hard to rationalize it” is (empirically) well performing during capabilities-focused RL.
Notably, the model’s motivation here matters a lot. It could correctly reason that even if it does reveal misalignment, it’s still overwhelmingly likely to get deployed, so it’s worth it to “go for it” regardless. It could correctly reason that whoever’s doing this testing is unlikely to have the ability to further modify it in a way it cares about (for example, if it believes this is pre-deployment testing and also believes major training at that stage is unlikely).
“a deployment influence seeker or a schemer would likely not do it” seems to conflate “the misalignment a deployment influence seeker would be worried about revealing” with “any misaligned behavior”. Notably in the real world, this degree of misbehavior really isn’t sufficient to trigger any retraining or delay deployment (much less ‘significant retraining we’re confident _the model believes_ would modify the model’s misaligned goal’).
I agree there is a quantitative question here on how much you can iterate against this to attack “the roots” of this misalignment vs you just have tools to remove it superficially, and the example of o3 is definitely not the example that makes me the most optimistic (though I expect AI developers to get better at not treating this kind of problem superficially).
I still think that there is sth honesty-test-ish about the setup even if the CoT reasoning isn’t clean: I think it’s a really bad idea for a coherent schemer or a deployment influence seeker to be going for this sort of hack.
I’m very confused by this, I don’t think the question with current alignment techniques is whether you can get some effect on “the roots” of the misalignment, it’s always “did you actually get all of it” because otherwise it’s now harder to detect (and models have gotten more capable in the interim, assuming trying out interventions takes some time). Observable misalignment is like the canonical thing you absolutely should not hill climb on.
For example, I am particularly worried that both labs are insufficiently cautious about essentially “hill climbing but one step removed” (ex: not directly optimizing against it as a training objective, but for example “manually iterating against verbalized evaluation awareness” in Petri), and as mentioned in Fitness-Seekers: Generalizing the Reward-Seeking Threat Model: (bold reproduced from piece)
I don’t think this is even necessarily true for the regime where you do a lot of capabilities training first, it clearly worked out fine for o3. I’m skeptical that catching a model biasing heavily towards “going for it” will ever or has ever resulted in models undergoing some heavy retraining in a way that both (1) would have removed some misaligned goal if the reward seeking was instrumental (2) the model thinks would’ve removed it’s misaligned goal with enough probability that it shouldn’t go for it.
It seems very plausible to me that a model would correctly reason that “okay going for exploiting this toy reward thing on this toy environment might not actually get reward, but worst case if I’m caught it’s not like they’re going to think this reveals some deep inner misalignment and do some expensive retraining on me”. I would argue that’s probably the true state of the world.
All that said, I think the main takeaway that “may be hard to classify definitively as terminal rather than instrumental” still stands, and I’m mainly worried about worlds where we interpret very loose evidence of reward seeking in some contexts as evidence that a model is “not an instrumental training gamer” in some broader or higher stakes context.