Are models like Opus 4.6 doing a similar thing to o1/o3 when reasoning?
There was a lot of talk about reasoning models like o1/o3 devolving into uninterpretable gibberish in their chains-of-thought, and that these were fundamentally a different kind of thing than previous LLMs. This was (to my understanding) one of the reasons only a summary of the thinking was available.
But when I use models like Opus 4.5/4.6 with extended thinking, the chains-of-thought (appear to be?) fully reported, and completely legible.
I’ve just realised that I’m not sure what’s going on here. Are models like Opus 4.6 closer to “vanilla” LLMs, or closer to o1/o3? Are they different in harnesses like Claude Code? Someone please enlighten me.
I don’t think those are raw CoTs, they have a summarizer model.
I remember one twitter post with erotic roleplay (“something something slobbering for mommy”??? I don’t remember) where summarizer model refused to summarize such perversion. Please help me find it?
While they do have a summarizer model, I’d guess that it isn’t used most of the time. The Claude Opus 4 & Sonnet 4 system card says: “For Claude Sonnet 4 and Claude Opus 4, we have opted to summarize lengthier thought processes using an additional, smaller model. In our experience, only around 5% of thought processes are long enough to trigger this summarization; the vast majority of thought processes are therefore shown in full.” Though the system cards of more recent models don’t specify whether this still applies, the reasoning of recent models usually feels very natural and, if my memory isn’t failing me, is very similar to what Sonnet & Opus 4′s reasoning was like. In contrast, there’s no ambiguity for OpenAI and Google’s models about whether the CoTs are summarized when reading the summaries.
If I’m correct and the CoTs of Claude models are indeed usually fully exposed, I can see two reasons why they look relatively natural. First, they’re usually much shorter than the CoTs of models like o3, which means that models like Opus 4.6 are indeed in some sense closer to “vanilla” LLMs. Second, Anthropic used to apply some optimization pressure on the CoTs of models from before the 4.5 series and, though they’re not doing that anymore, they trained the recent models on SFT data from earlier ones. Claude 3.7 Sonnet, for which they never used a summarizer, had much more readable CoTs than e.g. o1, so it seems plausible that Anthropic is the only company doing SFT on this kind of legible reasoning data.
Update: After reading Kei’s comment, I noticed that Sonnet 4.5′s system card also mentioned that the summarization only happens in a small minority of cases. This makes me less confident that I’m right, since it increases the likelihood that the absence of discussion in the Opus 4.5 and 4.6 system cards is a deliberate omission. On the other hand, Anthropic was very transparent about their approach to summarization up through Sonnet 4.5, so I’d also be slightly surprised about them silently changing this.
Anthropic summarizes their CoTs after Claude 4 models: > With extended thinking enabled, the Messages API for Claude 4 models returns a summary of Claude’s full thinking process. Summarized thinking provides the full intelligence benefits of extended thinking, while preventing misuse.
Although on the side note, it seems that frontier models suddenly solved the issue of alien languages in reasoning? In Apollo’s paper, they mentioned that o3 often uses languages like “disclaim illusion watchers”, but I haven’t heard similar issues for their new models (or models of other labs). I’m very interested to know whether they just decided to train against the CoT to make it more legible, or is there some other method in creating legible CoT.
I think the pressures towards illegible CoTs have been greatly overstated; the existing illegibilities in CoT’s could have come from many things apart from pressure towards condensed or alien languages.
Hmm, but when you use these models in the chat interface, you can literally open up the reasoning tab and watch it be generated in real time? It feels like there isn’t enough time here for that reasoning to have been generated by a summarizer
My understanding is that while Anthropic no longer directly RL’s the CoT to look aligned, they still SFT on rollouts that were subject to that same optimization pressure (this is all based on the Sabotage Risk Report for Opus 4).
We think it could be intuitive to the model, or trained in via RL, not to verbalize reasoning about misaligned goals, and to thus show very little sign of deceptive motives. This seems especially true if there has been training pressure to hide misaligned reasoning in the CoT, which was the case to a limited extent for these models according to Anthropic’s (nonpublic) responses to our assurance checklist.
As deranged as the o3 chains of thought could often appear on first read, that actually gave me a bit more confidence that they were more likely to be faithful / legible (to a trained monitor at least, i.e. the information was in there). In contrast, throughout model versions Claude stands out among models trained with outcome based RL:
We find that every reasoning model trained with outcome-based RL except Claude often produces illegible CoTs. R1, R1-Zero, and QwQ report the highest illegibility scores.
One more data point: During the time that the reasoning traces for Gemini 2.5 Pro were publicly available, I noted they also looked especially structured/legible. Though I never ran any quantitative analysis on it and it’s no longer possible to do.
In the Claude Opus+Sonnet 4 and Claude Sonnet 4.5 system cards, it was stated that Anthropic usually shares the full reasoning trace, with the exception of a small fraction of prompts where the reasoning trace is too long, after which it is summarized. From what I remember, the reasoning traces of those models usually looked legible.
They’ve removed this language from the recent Opus 4.5 and 4.6 system cards, which makes me think it is now likely summarized a larger fraction of the time or even all the time.
Are models like Opus 4.6 doing a similar thing to o1/o3 when reasoning?
There was a lot of talk about reasoning models like o1/o3 devolving into uninterpretable gibberish in their chains-of-thought, and that these were fundamentally a different kind of thing than previous LLMs. This was (to my understanding) one of the reasons only a summary of the thinking was available.
But when I use models like Opus 4.5/4.6 with extended thinking, the chains-of-thought (appear to be?) fully reported, and completely legible.
I’ve just realised that I’m not sure what’s going on here. Are models like Opus 4.6 closer to “vanilla” LLMs, or closer to o1/o3? Are they different in harnesses like Claude Code? Someone please enlighten me.
I don’t think those are raw CoTs, they have a summarizer model.
I remember one twitter post with erotic roleplay (“something something slobbering for mommy”??? I don’t remember) where summarizer model refused to summarize such perversion. Please help me find it?
EDIT: HA! Found it, despite twitter search being horrendous. Fucking twitter, wasted 25 minutes.
https://x.com/cis_female/status/2010128677158445517
While they do have a summarizer model, I’d guess that it isn’t used most of the time. The Claude Opus 4 & Sonnet 4 system card says: “For Claude Sonnet 4 and Claude Opus 4, we have opted to summarize lengthier thought processes using an additional, smaller model. In our experience, only around 5% of thought processes are long enough to trigger this summarization; the vast majority of thought processes are therefore shown in full.” Though the system cards of more recent models don’t specify whether this still applies, the reasoning of recent models usually feels very natural and, if my memory isn’t failing me, is very similar to what Sonnet & Opus 4′s reasoning was like. In contrast, there’s no ambiguity for OpenAI and Google’s models about whether the CoTs are summarized when reading the summaries.
If I’m correct and the CoTs of Claude models are indeed usually fully exposed, I can see two reasons why they look relatively natural. First, they’re usually much shorter than the CoTs of models like o3, which means that models like Opus 4.6 are indeed in some sense closer to “vanilla” LLMs. Second, Anthropic used to apply some optimization pressure on the CoTs of models from before the 4.5 series and, though they’re not doing that anymore, they trained the recent models on SFT data from earlier ones. Claude 3.7 Sonnet, for which they never used a summarizer, had much more readable CoTs than e.g. o1, so it seems plausible that Anthropic is the only company doing SFT on this kind of legible reasoning data.
Update: After reading Kei’s comment, I noticed that Sonnet 4.5′s system card also mentioned that the summarization only happens in a small minority of cases. This makes me less confident that I’m right, since it increases the likelihood that the absence of discussion in the Opus 4.5 and 4.6 system cards is a deliberate omission. On the other hand, Anthropic was very transparent about their approach to summarization up through Sonnet 4.5, so I’d also be slightly surprised about them silently changing this.
Do you think the slobbering thoughts were lengthy enough to trigger the summarizer?
Probably not, but the summarizer could additionally be triggered by unsafe CoTs.
Anthropic summarizes their CoTs after Claude 4 models:
> With extended thinking enabled, the Messages API for Claude 4 models returns a summary of Claude’s full thinking process. Summarized thinking provides the full intelligence benefits of extended thinking, while preventing misuse.
Source: https://platform.claude.com/docs/en/build-with-claude/extended-thinking
Although on the side note, it seems that frontier models suddenly solved the issue of alien languages in reasoning? In Apollo’s paper, they mentioned that o3 often uses languages like “disclaim illusion watchers”, but I haven’t heard similar issues for their new models (or models of other labs). I’m very interested to know whether they just decided to train against the CoT to make it more legible, or is there some other method in creating legible CoT.
I think the pressures towards illegible CoTs have been greatly overstated; the existing illegibilities in CoT’s could have come from many things apart from pressure towards condensed or alien languages.
Hmm, but when you use these models in the chat interface, you can literally open up the reasoning tab and watch it be generated in real time? It feels like there isn’t enough time here for that reasoning to have been generated by a summarizer
My understanding is that while Anthropic no longer directly RL’s the CoT to look aligned, they still SFT on rollouts that were subject to that same optimization pressure (this is all based on the Sabotage Risk Report for Opus 4).
As deranged as the o3 chains of thought could often appear on first read, that actually gave me a bit more confidence that they were more likely to be faithful / legible (to a trained monitor at least, i.e. the information was in there). In contrast, throughout model versions Claude stands out among models trained with outcome based RL:
For example from Reasoning Models Sometimes Output Illegible Chains of Thought:
One more data point: During the time that the reasoning traces for Gemini 2.5 Pro were publicly available, I noted they also looked especially structured/legible. Though I never ran any quantitative analysis on it and it’s no longer possible to do.
In the Claude Opus+Sonnet 4 and Claude Sonnet 4.5 system cards, it was stated that Anthropic usually shares the full reasoning trace, with the exception of a small fraction of prompts where the reasoning trace is too long, after which it is summarized. From what I remember, the reasoning traces of those models usually looked legible.
They’ve removed this language from the recent Opus 4.5 and 4.6 system cards, which makes me think it is now likely summarized a larger fraction of the time or even all the time.