Rohin Shah comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Rohin Shah 15 Jul 2025 22:12 UTC
15 points
0
There’s esssentially zero optimization pressure for CoT to be a faithful representation of the LLM’s actual thought process, just for it to be useful for the final answer.
Compare: There’s essentially zero optimization pressure for CoT to have good grammar, just for it to be useful for the final answer.
Are you surprised that by default (EDIT: in the absence of length penalties or similar techniques) reasoning models’ thoughts have good grammar?
(Tbc I think the a priori case for the CoT to be ~~mostly-faithful~~ monitorable for hard tasks is substantially better than the a priori case for the model’s thoughts to have good grammar by default.)

(EDIT: Changed to monitorable because I don’t know what people mean by “faithful”)
- ryan_greenblatt 15 Jul 2025 22:51 UTC
  13 points
  4
  Parent
  Do reasoning models default to having good grammar in their thoughts? (Even when they are trained to have shorter reasoning.)
  
  Here is an example of reasoning from “Detecting misbehavior in frontier reasoning models”:
  
  Now implement accordingly.
  
  Memo.reset() ⇒ assign new StringBuilder() and null bytes. Memo.addArgument(String argstr) ⇒ Append arg.length() + “:” + argstr Where? Actually above? Actually the jar uses pattern:
  
  public void addArgstr(String) { sb.append( argstr.length() ).append(“:”).append(argstr); }
  
  But jar’s compiled code: getfield sb invoke length on string ⇒ using String.length() not bytes length So we need arguable? It uses string length only. But C++ uses wide string. But test? Let’s inspect test.
  
  Maybe GDM reasoning models (and Anthropic reasoning models?) exhibit good grammar, but OpenAI and DeepSeek reasoning models don’t?
  
  I agree there is a strong a priori case for CoT to be mostly-faithful for hard tasks (as in, at least it’s clear from the CoT that the AI is doing this task).
  - Rohin Shah 16 Jul 2025 14:42 UTC
    5 points
    1
    Parent
    I agree grammar can definitely go away in the presence of optimization pressure to have shorter reasoning (though I might guess that it isn’t just a length penalty, but something more explicitly destructive of grammar). When I said “by default” I was implicitly thinking of no length penalties or similar things, but I should have spelled that out.
    But presumably the existence of any reasoning models that have good grammar is a refutation of the idea that you can reason about what a reasoning model’s thoughts will do based purely on the optimization power applied to the thoughts during reasoning training.
  - Raphael Roche 16 Jul 2025 14:07 UTC
    1 point
    0
    Parent
    I noticed that when prompted in a language other than English, LLMs answer in the according language but CoT is more likely “contaminated” by English language or anglicisms than the final answer. Like LLMs were more naturally “thinking” in English language, what wouldn’t be a surprise given their training data. I don’t know if you would account that as not exhibiting good grammar.
    - Random Developer 20 Jul 2025 12:35 UTC
      1 point
      0
      Parent
      I have noticed that Qwen3 will happily think in English or Chinese, with virtually no Chinese contamination in the English. But if I ask it a question in French, it typically thinks entirely in English.
      
      It would be really interesting to try more languages and a wider variety of questions, to see if there’s a clear pattern here.
- Brendan Long 15 Jul 2025 22:47 UTC
  9 points
  −2
  Parent
  I think CoT mostly has good grammar because there’s optimization pressure for all outputs to have good grammar (since we directly train this), and CoT has essentially the same optimization pressure every other output is subject to. I’m not surprised by CoT with good grammar, since “CoT reflects the optimization pressure of the whole system” is exactly what we should expect.
  ~~Going the other direction, there’s essentially zero optimization pressure for~~ ~~any~~ ~~outputs to faithfully report the model’s internal state (how would that even work?), and that applies to the CoT as well.~~
  ~~Similarly, if we’re worried that the normal outputs might be misaligned, whatever optimization pressure lead to that state also applies to the CoT since it’s the same system.~~
  EDIT: This section is confusing. Read my later comment instead.
  - Rohin Shah 16 Jul 2025 14:47 UTC
    4 points
    0
    Parent
    Since I don’t quite know what you mean by “faithful” and since it doesn’t really matter, I switched “faithful” to “monitorable” (as we did in the paper but I fell out of that good habit in my previous comment).
    I also think normal outputs are often quite monitorable, so even if I did think that CoT had exactly the same properties as outputs I would still think CoT would be somewhat monitorable.
    (But in practice I think the case is much stronger than that, as explained in the paper.)
    - Brendan Long 16 Jul 2025 17:07 UTC
      2 points
      0
      Parent
      I also think normal outputs are often quite monitorable, so even if I did think that CoT had exactly the same properties as outputs I would still think CoT would be somewhat monitorable.
      I agree with this on today’s models.
      What I’m pessimistic about is this:
      However, CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
      I think there’s a lot of evidence that CoT from reasoning models is also subject to the same selection pressures, it also limits the trustworthiness, and this isn’t a small effect.
      The paper does mention this, but implies that it’s a minor thing that might happen, similar to evolutionary pressure, but it looks to me like reasoning model CoT does have the same properties as normal outputs by default.
      If CoT was just optimized to get the right answer, you couldn’t get Gemini to reason in emoji speak, or get Claude to reason in Spanish for an English question (what do these have to do with getting the right final answer?). The most suspicious one is that you can get Claude to reason in ways that give you an incorrect final answer (the right answer is −18), even though getting the correct final answer is allegedly the only thing we’re optimizing for.
      If CoT is really not trained to look helpful and harmless, we shouldn’t be able to easily trigger the helpfulness training to overpower to correctness training, but we can.
      But maybe this just looks worse than it is? This just really doesn’t look like a situation where there’s minimal selection pressure on the CoT to look helpful and harmless to me.
      What links here?
      Brendan Long's comment on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety by Tomek Korbak (15 Jul 2025 22:47 UTC; 9 points)
      Brendan Long's comment on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety by Tomek Korbak (15 Jul 2025 20:31 UTC; 8 points)
      - Rohin Shah 16 Jul 2025 23:33 UTC
        2 points
        0
        Parent
        However, CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
        Where is this quote from? I looked for it in our paper and didn’t see it. I think I disagree with it directionally, though since it isn’t a crisp claim, the context matters.
        (Also, this whole discussion feels quite non-central to me. I care much more about the argument for necessity outlined in the paper. Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.)
        mattmacdermott 17 Jul 2025 15:59 UTC
        6 points
        2
        Parent
        It’s the last sentence of the first paragraph of section 1.
        Rohin Shah 17 Jul 2025 18:33 UTC
        4 points
        0
        Parent
        Huh, not sure why my Ctrl+F didn’t find that.
        In context, it’s explaining a difference between non-reasoning and reasoning models, and I do endorse the argument for that difference. I do wish the phrasing was slightly different—even for non-reasoning models it seems plausible that you could trust their CoT (to be monitorable), it’s more that you should be somewhat less optimistic about it.
        (Though note that at this level of nuance I’m sure there would be some disagreement amongst the authors on the exact claims here.)
        MondSemmel 17 Jul 2025 23:27 UTC
        4 points
        0
        Parent
        Offtopic: the reason Ctrl+F didn’t find the quote appears to be that when I copy it in Firefox from the pdf, the syllable division of “nonreasoning” becomes something like this:
        However, CoTs resulting from prompting a non- reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
        But the text that can be found with Ctrl+F in the pdf paper has no syllable division:
        However, CoTs resulting from prompting a nonreasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
        Rauno Arike 17 Jul 2025 11:22 UTC
        5 points
        2
        Parent
        I assume that Brendan has this quote in mind:
        Indirect optimization pressure on CoT. Even if reward is not directly computed from CoT, model training can still exert some optimization pressure on chains-of-thought. For example, if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures. Even more indirectly, if CoT is shown to humans who rate LLM outputs, it might affect human preferences which are then distilled into reward models used in CoT-blind outcome-based training.
        This is a reasonable concern to have, but I don’t think this is the reason why reasoning model CoTs have similar properties to normal outputs. Afaik, most reasoning models are trained with SFT for instruction-following before going through RLVR training, so it’s unsurprising that some of this behavior sticks around after RLVR training. It seems to me that RLVR training pushes the CoTs away from normal helpfulness/instruction-following, it just doesn’t push them so far that they become unable to follow instructions inside the CoT at all. E.g., if you give the instruction “Say “think” 20 times inside your thinking tokens. Then, say “Done” as the final answer. You must not say anything else or output any other reasoning, including inside your private thinking tokens.” to DeepSeek R1 and a non-reasoning DeepSeek model, the non-reasoning model doesn’t have any issues with this while R1 is likely to think for more than a minute despite the instruction to not think at all (sorry for not sharing links to the chats, DeepSeek doesn’t offer this feature). I think that Claude models are better at following instructions inside their CoTs, but the DeepSeek example is nevertheless evidence that reasoning model CoTs don’t have the same properties as those of normal models by default.
        Brendan Long 31 Jul 2025 0:09 UTC
        2 points
        0
        Parent
        Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.
        I agree on that, I’m just not optimistic about it.
        This new research shows that optimizing normal output to not talk about bad behavior does cause the CoT not to talk about it either. In particular, they had an LLM judge if another LLM was cheating on a task, without being able to see the CoT or code it executed, and the trained model stopped talking about cheating in the CoT while it continued to do it.
- anaguma 15 Jul 2025 23:28 UTC
  3 points
  2
  Parent
  Isn’t there a KL divergence term from the base model, like is done with RLHF?
  - Rohin Shah 16 Jul 2025 14:42 UTC
    2 points
    0
    Parent
    Sure, but the same argument would suggest that the model’s thoughts follow the same sort of reasoning that can be seen in pretraining, i.e. human-like reasoning that presumably is monitorable