Daniel Kokotajlo comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Daniel Kokotajlo 15 Jul 2025 18:46 UTC
LW: 20 AF: 7
8
AF
I’m very happy to see this happen. I think that we’re in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm into e.g. neuralese/recurrence/vector memory, etc. or simply start training/optimizing the CoT’s to look nice. (This is an important sub-plot in AI 2027) Right now we’ve just created common knowledge of the dangers of doing that, which will hopefully prevent that feared default outcome from occurring, or at least delay it for a while.
All this does is create common knowledge, it doesn’t commit anyone to anything, but it’s a start.
- Bogdan Ionut Cirstea 28 Jul 2025 16:59 UTC
  4 points
  2
  Parent
  at least delay it for a while
  Notably, even just delaying it until we can (safely) automate large parts of AI safety research would be both a very big deal, and intuitively seems quite tractable to me. E.g. the task-time-horizons required seem to be (only) ~100 hours for a lot of prosaic AI safety research:
  https://x.com/BogdanIonutCir2/status/1948152133674811518
  What links here?
  - Bogdan Ionut Cirstea's comment on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety by Tomek Korbak (28 Jul 2025 18:23 UTC; 3 points)
  - Bogdan Ionut Cirstea 29 Jul 2025 15:34 UTC
    2 points
    0
    Parent
    Based on current trends fron https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/, this could already have happened by sometime between 2027 and 2030:
    - Bogdan Ionut Cirstea 2 Aug 2025 12:03 UTC
      2 points
      0
      Parent
      And this suggests 100x acceleration in research cycles if ideation + implementation were automated, and humans were relegated to doing peer reviewing of AI-published papers:
      https://x.com/BogdanIonutCir2/status/1940100507932197217
- Hopenope 16 Jul 2025 1:31 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Is optimizing CoT to look nice a big concern? There are other ways to show a nice CoT without optimizing for it. The frontrunners also have some incentives to not show the real CoT. Additionally, there is a good chance that people prefer a nice structured summary of CoT by a small LLM when reasonings become very long and convoluted.
  - Daniel Kokotajlo 16 Jul 2025 5:15 UTC
    LW: 8 AF: 5
    −2
    AF Parent
    Yes, it’s a big concern. For example, over the next year, there might be more and more examples accumulating of AIs scheming or otherwise thinking bad thoughts in the CoT, some fraction of which will lead to bad behavior such as writing hacky code or lying to users. Companies will notice that if you just have a bad thoughts classifier look over the CoT during training and dole out negative reinforcement when it’s detected, the rates of bad behavior in deployment will go down. So, they’ll be under some economic and PR pressure to do that. But this way leads to disaster in the long run for reasons described here: https://www.planned-obsolescence.org/july-2022-training-game-report/ and also here: https://openai.com/index/chain-of-thought-monitoring/