To establish DFC as a possible legitimate tool to monitor CoT traces, after sanitisation, I inspected whether the trajectories still showed lower DFC wherever the forbidden concept was engaged
Right, I should have been more precise in my writing. In this specific scenario, it means the explicit presence of the forbidden keyword or a related synonym before I sanitize (replace with a placeholder) the CoT sentences. I used this as a validation step to show that lexical “engagement” is correlated with semantic “engagement” but not monotonically, so as to claim that they capture a similar underlying principle, but DFC is not merely a keyword detector. The key evidence, I think, is the last figure (Figure 11), where the DFC dropped even for sentences where the keyword did not appear at all, from ~0.89–0.90 of the no-semantic-engagement baseline to ~0.78.
What does it mean for a concept to be engaged?
Right, I should have been more precise in my writing. In this specific scenario, it means the explicit presence of the forbidden keyword or a related synonym before I sanitize (replace with a placeholder) the CoT sentences. I used this as a validation step to show that lexical “engagement” is correlated with semantic “engagement” but not monotonically, so as to claim that they capture a similar underlying principle, but DFC is not merely a keyword detector. The key evidence, I think, is the last figure (Figure 11), where the DFC dropped even for sentences where the keyword did not appear at all, from ~0.89–0.90 of the no-semantic-engagement baseline to ~0.78.