anaguma comments on Igor Ivanov’s Shortform

anaguma 8 Aug 2025 2:44 UTC
5 points
1
Once the model is aware that the CoTs are being monitored, e.g. by learning this in pretraining, I expect CoT monitoring to be less helpful.
- dr_s 8 Aug 2025 5:40 UTC
  4 points
  1
  Parent
  The question is whether an LLM even can keep a coherent long term plan like “deceive the evaluators” just as well without relying on CoT. I suppose it might be able to if it also has internal memory of some sort.
  - Jozdien 8 Aug 2025 13:29 UTC
    7 points
    0
    Parent
    I don’t think it needs to look like a coherent long-term plan from the model’s perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model’s outputs without requiring a coherent long-term plan.