Tomek Korbak comments on COT control: The Word Disappears, but the Thought Does Not

Tomek Korbak 27 Mar 2026 20:25 UTC
3 points
0

To establish DFC as a possible legitimate tool to monitor CoT traces, after sanitisation, I inspected whether the trajectories still showed lower DFC wherever the forbidden concept was engaged

What does it mean for a concept to be engaged?
- Pranjal Garg 28 Mar 2026 16:42 UTC
  2 points
  0
  Parent
  Right, I should have been more precise in my writing. In this specific scenario, it means the explicit presence of the forbidden keyword or a related synonym before I sanitize (replace with a placeholder) the CoT sentences. I used this as a validation step to show that lexical “engagement” is correlated with semantic “engagement” but not monotonically, so as to claim that they capture a similar underlying principle, but DFC is not merely a keyword detector. The key evidence, I think, is the last figure (Figure 11), where the DFC dropped even for sentences where the keyword did not appear at all, from ~0.89–0.90 of the no-semantic-engagement baseline to ~0.78.