I don’t think those are raw CoTs, they have a summarizer model.
I remember one twitter post with erotic roleplay (“something something slobbering for mommy”??? I don’t remember) where summarizer model refused to summarize such perversion. Please help me find it?
While they do have a summarizer model, I’d guess that it isn’t used most of the time. The Claude Opus 4 & Sonnet 4 system card says: “For Claude Sonnet 4 and Claude Opus 4, we have opted to summarize lengthier thought processes using an additional, smaller model. In our experience, only around 5% of thought processes are long enough to trigger this summarization; the vast majority of thought processes are therefore shown in full.” Though the system cards of more recent models don’t specify whether this still applies, the reasoning of recent models usually feels very natural and, if my memory isn’t failing me, is very similar to what Sonnet & Opus 4′s reasoning was like. In contrast, there’s no ambiguity for OpenAI and Google’s models about whether the CoTs are summarized when reading the summaries.
If I’m correct and the CoTs of Claude models are indeed usually fully exposed, I can see two reasons why they look relatively natural. First, they’re usually much shorter than the CoTs of models like o3, which means that models like Opus 4.6 are indeed in some sense closer to “vanilla” LLMs. Second, Anthropic used to apply some optimization pressure on the CoTs of models from before the 4.5 series and, though they’re not doing that anymore, they trained the recent models on SFT data from earlier ones. Claude 3.7 Sonnet, for which they never used a summarizer, had much more readable CoTs than e.g. o1, so it seems plausible that Anthropic is the only company doing SFT on this kind of legible reasoning data.
Update: After reading Kei’s comment, I noticed that Sonnet 4.5′s system card also mentioned that the summarization only happens in a small minority of cases. This makes me less confident that I’m right, since it increases the likelihood that the absence of discussion in the Opus 4.5 and 4.6 system cards is a deliberate omission. On the other hand, Anthropic was very transparent about their approach to summarization up through Sonnet 4.5, so I’d also be slightly surprised about them silently changing this.
I don’t think those are raw CoTs, they have a summarizer model.
I remember one twitter post with erotic roleplay (“something something slobbering for mommy”??? I don’t remember) where summarizer model refused to summarize such perversion. Please help me find it?
EDIT: HA! Found it, despite twitter search being horrendous. Fucking twitter, wasted 25 minutes.
https://x.com/cis_female/status/2010128677158445517
While they do have a summarizer model, I’d guess that it isn’t used most of the time. The Claude Opus 4 & Sonnet 4 system card says: “For Claude Sonnet 4 and Claude Opus 4, we have opted to summarize lengthier thought processes using an additional, smaller model. In our experience, only around 5% of thought processes are long enough to trigger this summarization; the vast majority of thought processes are therefore shown in full.” Though the system cards of more recent models don’t specify whether this still applies, the reasoning of recent models usually feels very natural and, if my memory isn’t failing me, is very similar to what Sonnet & Opus 4′s reasoning was like. In contrast, there’s no ambiguity for OpenAI and Google’s models about whether the CoTs are summarized when reading the summaries.
If I’m correct and the CoTs of Claude models are indeed usually fully exposed, I can see two reasons why they look relatively natural. First, they’re usually much shorter than the CoTs of models like o3, which means that models like Opus 4.6 are indeed in some sense closer to “vanilla” LLMs. Second, Anthropic used to apply some optimization pressure on the CoTs of models from before the 4.5 series and, though they’re not doing that anymore, they trained the recent models on SFT data from earlier ones. Claude 3.7 Sonnet, for which they never used a summarizer, had much more readable CoTs than e.g. o1, so it seems plausible that Anthropic is the only company doing SFT on this kind of legible reasoning data.
Update: After reading Kei’s comment, I noticed that Sonnet 4.5′s system card also mentioned that the summarization only happens in a small minority of cases. This makes me less confident that I’m right, since it increases the likelihood that the absence of discussion in the Opus 4.5 and 4.6 system cards is a deliberate omission. On the other hand, Anthropic was very transparent about their approach to summarization up through Sonnet 4.5, so I’d also be slightly surprised about them silently changing this.
Do you think the slobbering thoughts were lengthy enough to trigger the summarizer?
Probably not, but the summarizer could additionally be triggered by unsafe CoTs.