Pranjal Garg comments on COT control: The Word Disappears, but the Thought Does Not

Pranjal Garg 28 Mar 2026 17:06 UTC
2 points
0
1) Yes, that is right.

2) Yes, that is what I currently think as well, that the keyword-based controllability metric might be an overestimate and things are probably a bit more optimistic for safety. A correlation between a blackbox monitor, keyword measure, and a semantic measure does sound very interesting. I would be very willing to see how this works and do plan to do this.

3) Right, this is a big limitation right now. I have to scale my current experiments and judge the robustness of my measure by varying three things which I haven’t done yet: the external embedder, chunking strategies (currently just complete sentences), and phrasing of the forbidden concept. This is my immediate next step.