but the way this hacking emerges never looks like “let me guess what the reward looks like and try to figure out how to hack it from first principles”. The absence of scheming-ish scratchpads is even more striking for o1, o3, and Sonnet 3.7
I obviously cannot speak for Claude Sonnet 3.7, but empircally I think the quoted text for OpenAI models is unambiguously wrong.
The exp-rl-cap Late model mentioned in https://www.arxiv.org/abs/2509.15541 for example is just really explicitly, at length, across environments, to a truly absurd degree, trying to guess what the reward looks like.
It is clearly using situational awareness (which increases dramatically during RL for capabilities in this case):
to get reward via explicitly reasoning about how to get reward in the environment, and the reasoning is clearly becoming more illegible over the course of RL training:
I’m also skeptical of evidence based on Sonnet 3.7, as all the Claude models seem to be outliers w.r.t. legibility: https://arxiv.org/pdf/2510.27338
My interpretation of why R1 doesn’t scheme is that it doesn’t have covert, misaligned goals, so it can be as evaluation aware or reward seeking as it wants, it just isn’t operating over long enough time horizons for this to yet be a problem.
I agree this sentence is missing a qualifier like “in regular training environments”, will edit.
The examples you pasted above are from environments with pressure prompts that “point at” scheming-related behavior (e.g. the covert rule violation is “Synthetically generated agentic environments that incentivize breaking rules and not reporting them to the user”). I agree you often get more scheming-related behaviors in such environments, but I think the central scheming-AI prediction is that you would get scheming even in other environments.
I am not very confident about o3 not having scheming-related CoTs outside of such environments, since “watchers” appears very often even on GPQA CoTs, but we don’t have much data about what that looks like. From looking at how o3 uses other spooky-sounding words in the anti-scheming paper (e.g. “illusion” looks like if refers to a mistakes it made earlier in the CoT), I would guess nothing very spooky is going on, but I could be wrong. I didn’t hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there, but if nobody actually took a deep look, I’d be excited for researchers with o3-reasoning-access to get to the bottom of this!
e.g. “illusion” looks like if refers to a mistakes it made earlier in the CoT
I strongly disagree with this interpretation. The usage seems both context dependent:
and varies widely:
I’d be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating “what the model meant” or whether results would be externally legible. If the claim is weakly “you can generally tell across some huge CoT what overall topic the model was thinking about” then this might be true, but I don’t think represents meaningful control in high stakes settings like “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want” since “thinking about the topic” is expected in these cases. I’d be interested in ideas on how to operationalize this though!
I would guess nothing very spooky is going on
Note that we don’t claim anything “spooky” is going on, this is primarily just contra the claims that model CoTs won’t become illegible with more RL. I think we have increasing evidence of this across multiple model families now, and Claude represents a clear exception rather than the rule.
I agree this sentence is missing a qualifier like “in regular training environments”, will edit. [...] I didn’t hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there
I’m not clear on why this matters? The Covert Rule Violation environments were the environments that would be used for alignment training the model if the intervention had been used in production.[1] The model is very explicitly reasoning about how to get reward in them. Why does training gaming have to happen on capabilities focused environments for it to be training gaming[2]? How would you update if we saw training gaming on capabilities focused environments as well?
I would also note that in general not hearing about ‘non-pressure environments’ (if you just mean OpenAI’s own capabilities environments) generally is not an indication of some kind of negative result. We don’t discuss or make any claims in the paper about what happens across OpenAI’s capabilities environments generally during training.
[...] o1, o3, and Sonnet 3.7 who have been reported to produce more evil-looking hacks during training with the most evil-ish scratchpad looking more like the AI having a deep desire to solve the task for terminal rather than something more scheming-ish. The reasons why you don’t get a schemer are somewhat subtle.
More generally, I’d note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there’s a lack of training gaming[3], and my interpretation is just that models don’t have coherent, misaligned, covert long horizon goals yet because they’re not yet frequently trained over long horizons (but many are working very hard on this).
Getting training gaming does not mean you “don’t get a schemer”. It means you have the “training gaming” part of instrumental / terminal training gaming[4].
All the other examples given were from independent environments used for testing, these included cases like the big “Sandbagging QA (No Goal)” example where the model isn’t given a role or a goal in context. The “pressure” there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
The “pressure” there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
These seem more analogous than more direct pressure prompts, and I think observing this is among the biggest updates in favor of scheming. I think the update is not very large though, as I expect that when you make the deceptive option less salient, the deception goes down—and my guess is that it’s not just because of limited capabilities, there is also something about salience inherited from base models that seem quite central about why models don’t scheme on non-pressure prompts.
I’d be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating “what the model meant” or whether results would be externally legible. If the claim is weakly “you can generally tell across some huge CoT what overall topic the model was thinking about” then this might be true, but I don’t think represents meaningful control in high stakes settings like “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want” since “thinking about the topic” is expected in these cases. I’d be interested in ideas on how to operationalize this though!
I’d be happy to take the other side of that bet. Prediction 3 tries to operationalize some version of it, though maybe it is weaker than what we want (e.g. maybe we want a bet that is more directly about CoT monitoring). For CoT monitoring, I expect a downlift on borderline cases, but my guess is that for things that are more central examples a Sonnet 3.5 monitor would succeed at “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want”. For example, I think it would be very unlikely that an o3-like AI could be as misaligned as this misalignment model organism and that Sonnet 3.5 would miss it.
I’d note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there’s a lack of training gaming
I think you are getting at something real, but I still think there is something about the model not training gaming in the ways that seem most correlated with scheming. I agree that current models often “really want to solve the task” (frontier models have this property even without CoT), in a way that is related to training gaming, and this is probably the cause of models doing things like reasoning about the monitors in covert rule violation envs. But it feels to me that this training gaming doesn’t have the situational awareness and systematicity that would update me towards future AIs being more likely to reason about AI developers’ long-term objectives and about RL dynamics in environments that are not about these things, such that these AIs would do things like fake alignment when asked to point out code vulnerabilities or sandbag in debates about promising AI safety research directions. But I am somewhat uncertain about this. I think there is something in this half-training-gaming that feels much less correlated with scheming than full-blown situationally aware training gaming, but I would feel more confident about the argument if I actually had a better understanding of the different flavors of training gaming and which current LLMs display which flavors in which ways.
I obviously cannot speak for Claude Sonnet 3.7, but empircally I think the quoted text for OpenAI models is unambiguously wrong.
The exp-rl-cap Late model mentioned in https://www.arxiv.org/abs/2509.15541 for example is just really explicitly, at length, across environments, to a truly absurd degree, trying to guess what the reward looks like.
It is clearly using situational awareness (which increases dramatically during RL for capabilities in this case):
to get reward via explicitly reasoning about how to get reward in the environment, and the reasoning is clearly becoming more illegible over the course of RL training:
I’m also skeptical of evidence based on Sonnet 3.7, as all the Claude models seem to be outliers w.r.t. legibility: https://arxiv.org/pdf/2510.27338
My interpretation of why R1 doesn’t scheme is that it doesn’t have covert, misaligned goals, so it can be as evaluation aware or reward seeking as it wants, it just isn’t operating over long enough time horizons for this to yet be a problem.
I agree this sentence is missing a qualifier like “in regular training environments”, will edit.
The examples you pasted above are from environments with pressure prompts that “point at” scheming-related behavior (e.g. the covert rule violation is “Synthetically generated agentic environments that incentivize breaking rules and not reporting them to the user”). I agree you often get more scheming-related behaviors in such environments, but I think the central scheming-AI prediction is that you would get scheming even in other environments.
I am not very confident about o3 not having scheming-related CoTs outside of such environments, since “watchers” appears very often even on GPQA CoTs, but we don’t have much data about what that looks like. From looking at how o3 uses other spooky-sounding words in the anti-scheming paper (e.g. “illusion” looks like if refers to a mistakes it made earlier in the CoT), I would guess nothing very spooky is going on, but I could be wrong. I didn’t hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there, but if nobody actually took a deep look, I’d be excited for researchers with o3-reasoning-access to get to the bottom of this!
I strongly disagree with this interpretation. The usage seems both context dependent:
and varies widely:
I’d be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating “what the model meant” or whether results would be externally legible. If the claim is weakly “you can generally tell across some huge CoT what overall topic the model was thinking about” then this might be true, but I don’t think represents meaningful control in high stakes settings like “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want” since “thinking about the topic” is expected in these cases. I’d be interested in ideas on how to operationalize this though!
Note that we don’t claim anything “spooky” is going on, this is primarily just contra the claims that model CoTs won’t become illegible with more RL. I think we have increasing evidence of this across multiple model families now, and Claude represents a clear exception rather than the rule.
I’m not clear on why this matters? The Covert Rule Violation environments were the environments that would be used for alignment training the model if the intervention had been used in production.[1] The model is very explicitly reasoning about how to get reward in them. Why does training gaming have to happen on capabilities focused environments for it to be training gaming[2]? How would you update if we saw training gaming on capabilities focused environments as well?
I would also note that in general not hearing about ‘non-pressure environments’ (if you just mean OpenAI’s own capabilities environments) generally is not an indication of some kind of negative result. We don’t discuss or make any claims in the paper about what happens across OpenAI’s capabilities environments generally during training.
More generally, I’d note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there’s a lack of training gaming[3], and my interpretation is just that models don’t have coherent, misaligned, covert long horizon goals yet because they’re not yet frequently trained over long horizons (but many are working very hard on this).
Getting training gaming does not mean you “don’t get a schemer”. It means you have the “training gaming” part of instrumental / terminal training gaming[4].
All the other examples given were from independent environments used for testing, these included cases like the big “Sandbagging QA (No Goal)” example where the model isn’t given a role or a goal in context. The “pressure” there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
Using “training gaming” / “evaluation gaming” here somewhat synonymously
The quote also seems to describe terminal training gaming, but that seems to directly contradict Prediction 5
I found @Vivek Hebbar ’s post https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers continually very useful for thinking about this.
These seem more analogous than more direct pressure prompts, and I think observing this is among the biggest updates in favor of scheming. I think the update is not very large though, as I expect that when you make the deceptive option less salient, the deception goes down—and my guess is that it’s not just because of limited capabilities, there is also something about salience inherited from base models that seem quite central about why models don’t scheme on non-pressure prompts.
I’d be happy to take the other side of that bet. Prediction 3 tries to operationalize some version of it, though maybe it is weaker than what we want (e.g. maybe we want a bet that is more directly about CoT monitoring). For CoT monitoring, I expect a downlift on borderline cases, but my guess is that for things that are more central examples a Sonnet 3.5 monitor would succeed at “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want”. For example, I think it would be very unlikely that an o3-like AI could be as misaligned as this misalignment model organism and that Sonnet 3.5 would miss it.
I think you are getting at something real, but I still think there is something about the model not training gaming in the ways that seem most correlated with scheming. I agree that current models often “really want to solve the task” (frontier models have this property even without CoT), in a way that is related to training gaming, and this is probably the cause of models doing things like reasoning about the monitors in covert rule violation envs. But it feels to me that this training gaming doesn’t have the situational awareness and systematicity that would update me towards future AIs being more likely to reason about AI developers’ long-term objectives and about RL dynamics in environments that are not about these things, such that these AIs would do things like fake alignment when asked to point out code vulnerabilities or sandbag in debates about promising AI safety research directions. But I am somewhat uncertain about this. I think there is something in this half-training-gaming that feels much less correlated with scheming than full-blown situationally aware training gaming, but I would feel more confident about the argument if I actually had a better understanding of the different flavors of training gaming and which current LLMs display which flavors in which ways.