If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?
If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
I wonder if their CoTs are less legible?
My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?