Bogdan Ionut Cirstea comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Bogdan Ionut Cirstea 28 Jul 2025 18:23 UTC
3 points
0
Excited about this research agenda, including its potential high synergy with unlearning from weights:
https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm.
Intuitively, if one (almost-verifiably) removes the most dangerous kinds of information from the weights, the model could then only access it in-context. If that’s in the (current) form of text, it should fit perfectly with CoT monitorability.
My guess is that on the current ‘default’ ML progress trajectory, this combo would probably buy enough safety all the way to automation of ~all prosaic AI safety research. Related, this other comment of mine on this post:
https://www.lesswrong.com/posts/7xneDbsgj6yJDJMjK/chain-of-thought-monitorability-a-new-and-fragile?commentId=eqidvTEaom4bmmSkv
- Bogdan Ionut Cirstea 5 Aug 2025 12:59 UTC
  2 points
  0
  Parent
  It’s probably also very synergystic with various d/acc approaches and measures more broadly.
  Intuitively, the higher the (including global) barriers to AI takeover, human extinction, etc., the more likely the AI will have to do a lot of reasoning and planning to have a decent probability of success; so the more difficult it will be to successfully achieve that opaquely and not be caught by CoT (and inputs, and outputs, and tools, etc.) monitors.