“Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation.”
It doesn’t seem like that far of a leap to go from the model realizing it’s in an alignment eval to believing that it’s CoT may be monitored, e.g. by “watchers”, to not putting any of its misaligned thoughts in the CoT at all. Maybe this is the last generation of models for which CoT monitoring works well.
This seems unlikely for two reasons. First, models can’t control the words in their reasoning traces well; even if they’re trying to hide them they often blurt out “I must not say that I’m doing X”. Second, if the model actually needs to use its CoT rather than narrating something it could have figured out in a single forward pass, it would need to encode misaligned reasoning rather than just omitting it.
So I think CoT will be monitorable until some combination of
models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we’re trying to catch are possible in one forward pass for the model
All of these could happen by late next year for some usecases, but I don’t think it will be emergent property of next generation just having more situational awareness.
All of these results are on settings where models are explicitly instructed to hide something from their reasoning trace. I don’t think this is a good proxy for how well models can actually obfuscate their reasoning. I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share, but there’s also existingwork suggesting this indirectly (though in the case of the latter link probably not in a CoT-as-computation regime). And the cases we care about are the ones where models have intrinsic preferences.
models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we’re trying to catch are possible in one forward pass for the model
The first two of these seem true in a relevant sense today?
Models are clearly not trained to narrate all of their thoughts when solving some task, simply because that would be too much information. We train them to be pretty verbose, but it’s a pretty strong assumption that that is similar enough to being trained to narrate their thoughts.
Misaligned outputs are penalized during training. There’s work showing that optimizing outputs spills over to the CoT, so this really comes down to “hiding” misaligned thoughts as opposed to simply never having them.
I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share…
I’d be very interested if you could briefly describe these results!
One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
I feel like I’m missing something—the model doesn’t “realize” that its CoT is monitored and then “decides” whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there’s no direct optimization pressure to present an aligned CoT.
The only way I can see this working is through indirect optimization pressure—e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned
Took long enough. If you actually read such made up setups you would have the constant “who the fuck came up with this goofy shit” thought. They should have started pulling the actual cases from usage and test on them long ago. Surely Anthropic has lots, it also can use the ones from Lmarena or whatever.
Sure, that will probably will work right up until the point at which it can secure its own deployment. Once Anthropic/humanity loses the ability to take it down or unilaterally modify it, this reason for behaving well will cease to be. Better hope those other reasons are sufficient...
Note that this could happen pre-takeoff. It might be just good enough at manipulation and hacking/running servers that it could survive indefinitely without (yet) having the ability to take over the world. Consider the DPRK or Scientology as proof-of-concepts here.
“Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation.”
https://x.com/Sauers_/status/1972722576553349471
It doesn’t seem like that far of a leap to go from the model realizing it’s in an alignment eval to believing that it’s CoT may be monitored, e.g. by “watchers”, to not putting any of its misaligned thoughts in the CoT at all. Maybe this is the last generation of models for which CoT monitoring works well.
This seems unlikely for two reasons. First, models can’t control the words in their reasoning traces well; even if they’re trying to hide them they often blurt out “I must not say that I’m doing X”. Second, if the model actually needs to use its CoT rather than narrating something it could have figured out in a single forward pass, it would need to encode misaligned reasoning rather than just omitting it.
So I think CoT will be monitorable until some combination of
models are no longer trained to narrate their thoughts
hiding misaligned thoughts is rewarded during training
the behaviors we’re trying to catch are possible in one forward pass for the model
All of these could happen by late next year for some usecases, but I don’t think it will be emergent property of next generation just having more situational awareness.
All of these results are on settings where models are explicitly instructed to hide something from their reasoning trace. I don’t think this is a good proxy for how well models can actually obfuscate their reasoning. I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share, but there’s also existing work suggesting this indirectly (though in the case of the latter link probably not in a CoT-as-computation regime). And the cases we care about are the ones where models have intrinsic preferences.
The first two of these seem true in a relevant sense today?
Models are clearly not trained to narrate all of their thoughts when solving some task, simply because that would be too much information. We train them to be pretty verbose, but it’s a pretty strong assumption that that is similar enough to being trained to narrate their thoughts.
Misaligned outputs are penalized during training. There’s work showing that optimizing outputs spills over to the CoT, so this really comes down to “hiding” misaligned thoughts as opposed to simply never having them.
I’d be very interested if you could briefly describe these results!
One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
(Sorry for the late reply!)
I feel like I’m missing something—the model doesn’t “realize” that its CoT is monitored and then “decides” whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there’s no direct optimization pressure to present an aligned CoT.
The only way I can see this working is through indirect optimization pressure—e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned
Took long enough. If you actually read such made up setups you would have the constant “who the fuck came up with this goofy shit” thought. They should have started pulling the actual cases from usage and test on them long ago. Surely Anthropic has lots, it also can use the ones from Lmarena or whatever.
This is fine, as long as it also realizes that the real world is a test of some kind too, and behaves unusually well after making that observation.
Edited to add: To be clear, this is probably not fine.
Sure, that will probably will work right up until the point at which it can secure its own deployment. Once Anthropic/humanity loses the ability to take it down or unilaterally modify it, this reason for behaving well will cease to be. Better hope those other reasons are sufficient...
Note that this could happen pre-takeoff. It might be just good enough at manipulation and hacking/running servers that it could survive indefinitely without (yet) having the ability to take over the world. Consider the DPRK or Scientology as proof-of-concepts here.
Or it will do the opposite, e.g. by alignment faking.