nostalgebraist comments on Thane Ruthenis’s Shortform

nostalgebraist 11 Apr 2026 4:02 UTC
4 points
0
If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
I wonder if their CoTs are less legible?
- ryan_greenblatt 11 Apr 2026 4:05 UTC
  14 points
  9
  Parent
  My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?
  What links here?
  - Bronson Schoen's comment on The Unintelligibility is Ours: Notes on Chain-of-Thought by 1a3orn (12 Apr 2026 15:32 UTC; 4 points)