nostalgebraist comments on ryan_greenblatt’s Shortform

nostalgebraist 6 Mar 2026 18:35 UTC
LW: 2 AF: 2
0
AF
I’m curious what Sonnet 4.5′s score would be on your external CoT controllability eval. I.e., what Sonnet 4.5′s blue bar in your Fig. 25 would look like. (I couldn’t find a result like this in your paper, but I might have overlooked something)
My casual impression has been that external CoT with Sonnet 4.5 is fairly steerable, but this impression is mostly based on “in practice I have no trouble getting Sonnet 4.5 to do the kind of CoT I need” rather than any kind of systematic investigation.
It seems like at least one of the following must be true:
1. There are some counterexamples to Ryan’s claim in the original comment (that recent Anthropic models don’t draw much of a distinction between “external” and “non-external” CoT)
  1. What would imply this: Sonnet 4.5 gets a high score on your external CoT controllability eval^[1]
2. Recent Anthropic models, unlike Sonnet 3.7, are not controllable in your sense even when CoT is external
  1. What would imply this: Sonnet 4.5 gets a low score on your external CoT controllability eval
And either of these would be interesting if true, so the eval seems worth running.
My guess is that Sonnet 4.5 will get a high score, close to its “output controllability,” score. I expect this both because that’s the trend we see in external CoT results for other models, and because the distinction between “external CoT controllability” and “ordinary instruction-following” is fairly blurry. “External CoT” is just a particular case of “some portion of the response that was requested in the instructions,” and it would be strange if instruction-following took a nosedive in this one case^[2] while being so robust elsewhere.
1. ^
  Note that Ryan’s claim could still be broadly true even if there are a few known counterexamples.
  If you were to find a difference between external and non-external controllability for Sonnet 4.5 on your eval, but meanwhile these two conditions were found to be largely the same in various other cases, this might be cause for concern that your eval is somehow testing an “unusual edge case,” and that situations “in the wild” where we care about controllability might produce qualitatively different behavior.
2. ^
  And one would have to wonder what it is, precisely, that constitutes “this case.” Whether something is or isn’t “CoT” is a fuzzy distinction; arguably, virtually anything than an LLM assistant writes is at least sort of CoT-like, in that the act of composing earlier portions of a response typically helps the model figure out what to say in the later portions.