I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this.
It’s particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on “Reasoning faithfulness” (kudos to them for providing so much other information!). Naively, this seems like some evidence that they are training against CoT and didn’t want to disclose this.[2]
I think we should try to create an incentive gradient that pushes AI companies to disclose information even if that information makes them look bad, so we should make companies feel some heat for not disclosing (reasonably) important information like how/whether they are training against CoT.
It’s worth noting that some versions of training against CoT might be totally fine (as in, they don’t incentivize obfuscation of misaligned reasoning) and that some particular approach for training against CoT could be worth it even if it does incentivize obfuscating misaligned reasoning. For instance, it could end up being the case that training against CoT works well and we end up with sufficient confidence that this doesn’t result in substantial obfuscation of reasoning. The situation is also just generally complicated because things like “what SFT/prompting initialization do you use for your CoT” could have big effects on obfuscation and aren’t exactly well described as applying optimization pressure to CoT. (But, if you start iterating on the SFT/prompting initialization to reduce misalignment observed in the CoT, this could cause problems.)
Given that the situation is complicated, ideally AI companies would publicly disclose enough information that other actors can come to their own conclusions about how worried they should be about CoT obfuscation.[3] If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.
Source: METR’s evaluation of GPT-5 says that OpenAI asserts “There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces” and “There’s no particular reason to expect that GPT-5′s reasoning traces are less monitorable than METR would assume.”. It’s of course possible that OpenAI will train against CoT in future models and it’s possible that for GPT-5 they did train against CoT in some way that they think wouldn’t directly apply pressure to hide/obfuscate misaligned reasoning (or they could be lying).
Of course, it is totally possible that Anthropic just didn’t get around to saying anything or just defaulted to secrecy rather than specifically going out of their way to not say anything in this case. I think the prior probability on this happening just because not saying anything is the default (or something like this) is high.
The information that OpenAI has disclosed doesn’t suffice for other actors (that don’t have private information) to come to their own conclusions without trusting relatively general assurances from OpenAI that aren’t backed by very specific claims. Of course, this is still much more information about training against CoT than other companies!
If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.
I worry this is exploitable enough to the point of being actively misleading.
There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces
Like I just actively disagree with this characterization based on what’s publicly known.
There’s no particular reason to expect that GPT-5’s reasoning traces are less monitorable than METR would assume.
I also have no idea what to take from this. What did METR assume?
For labs, if you have a new product safety technique, it’s a lot of extra work to deeply study the effects on obfuscation. If that technique puts optimization pressure on the CoT, until someone does that extra study you can still claim that you’re not aware of any direct pressure to obfuscate. This is my impression of the current state of Deliberative Alignment for example.
My overall concern is that labs will just post something that uses very careful wording to the effect of:
External researchers interpret it as “they do not apply direct optimization pressure to the cot”
Looks like it isn’t specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don’t[1].
I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this.
It’s particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on “Reasoning faithfulness” (kudos to them for providing so much other information!). Naively, this seems like some evidence that they are training against CoT and didn’t want to disclose this.[2]
I think we should try to create an incentive gradient that pushes AI companies to disclose information even if that information makes them look bad, so we should make companies feel some heat for not disclosing (reasonably) important information like how/whether they are training against CoT.
It’s worth noting that some versions of training against CoT might be totally fine (as in, they don’t incentivize obfuscation of misaligned reasoning) and that some particular approach for training against CoT could be worth it even if it does incentivize obfuscating misaligned reasoning. For instance, it could end up being the case that training against CoT works well and we end up with sufficient confidence that this doesn’t result in substantial obfuscation of reasoning. The situation is also just generally complicated because things like “what SFT/prompting initialization do you use for your CoT” could have big effects on obfuscation and aren’t exactly well described as applying optimization pressure to CoT. (But, if you start iterating on the SFT/prompting initialization to reduce misalignment observed in the CoT, this could cause problems.)
Given that the situation is complicated, ideally AI companies would publicly disclose enough information that other actors can come to their own conclusions about how worried they should be about CoT obfuscation.[3] If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.
Source: METR’s evaluation of GPT-5 says that OpenAI asserts “There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces” and “There’s no particular reason to expect that GPT-5′s reasoning traces are less monitorable than METR would assume.”. It’s of course possible that OpenAI will train against CoT in future models and it’s possible that for GPT-5 they did train against CoT in some way that they think wouldn’t directly apply pressure to hide/obfuscate misaligned reasoning (or they could be lying).
Of course, it is totally possible that Anthropic just didn’t get around to saying anything or just defaulted to secrecy rather than specifically going out of their way to not say anything in this case. I think the prior probability on this happening just because not saying anything is the default (or something like this) is high.
The information that OpenAI has disclosed doesn’t suffice for other actors (that don’t have private information) to come to their own conclusions without trusting relatively general assurances from OpenAI that aren’t backed by very specific claims. Of course, this is still much more information about training against CoT than other companies!
Anthropic has now clarified this in their system card for Claude Haiku 4.5:
Thanks to them for doing this!
See also Sam Bowman’s tweet thread about this.
I worry this is exploitable enough to the point of being actively misleading.
Like I just actively disagree with this characterization based on what’s publicly known.
I also have no idea what to take from this. What did METR assume?
For labs, if you have a new product safety technique, it’s a lot of extra work to deeply study the effects on obfuscation. If that technique puts optimization pressure on the CoT, until someone does that extra study you can still claim that you’re not aware of any direct pressure to obfuscate. This is my impression of the current state of Deliberative Alignment for example.
My overall concern is that labs will just post something that uses very careful wording to the effect of:
External researchers interpret it as “they do not apply direct optimization pressure to the cot”
Actually what the lab means is “we don’t literally do the thing from https://arxiv.org/abs/2503.11926 “
Looks like it isn’t specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...