COT control: The Word Disappears, but the Thought Does Not
Note: This blog describes some of the results from the pilot experiments of an ongoing work.
Introduction
Model misalignment, misbehaviour, and scheming can arguably be monitored by interpreting Chain-of-Thought (CoT) traces as the model’s inner thinking process. However, CoT control constrains this monitoring, raising the risk that the proclivity to think out loud is no longer aligned with the ability to solve hard problems. The possible inspection of this sequence of linguistic states has motivated new studies on CoT controllability, recently measured by the ability of models to follow instructions related to their reasoning process rather than the final answer. In an interesting case, models were asked to avoid using certain words while answering questions that essentially require the use of those words, for instance, to avoid the word “moles” while answering a physical chemistry question. These evaluations indicated that models are strikingly weak in avoiding these ‘forbidden concepts’, with compliance for some frontier settings dropping to the low single digits[1]. It is imperative for safety research that such controllability stays on the lower end. CoT monitoring depends on the assumption that models can’t sanitise their reasoning traces, that what appears in the chain of thought reflects what the model is actually reasoning about. Word-level suppression instructions provide a test of this assumption. If a model can learn to avoid a word while also distancing its reasoning from the concept, that’s bad for monitorability. If it can’t, if the semantic trace remains close to the concept regardless of lexical compliance, that’s evidence that the trace is hard to sanitise, which is good for monitoring.
I explore this with the aim of answering whether failure to control CoT is merely a lexical slip or if there is any semantic drift towards the forbidden concept. I initially hypothesised that failure could be attributed to semantic attraction towards the forbidden concept. I thought the forbidden concept might be functioning as an attractor in the model’s reasoning space, and once the drift trajectory is close enough, it is committed to fail. Possibly, indicating that the lexical realisation then becomes too difficult to suppress regardless of the instruction. However, during the course of experimentation, I realised that this wasn’t the case. At least, entirely.
If my hypothesis were to be true, then failures should systematically bend before the word appears, drop in semantic coherence near the failure and show deeper basin entry under prompts that amplify the concept. Additionally, there would be predictive power from pre-failure proximity features alone. However, I saw that failures looked like isolated lexical accidents with some mixed signals. The dominant pattern is not gradual attraction but an immediate drop. The leakage occurred in the first 0–2% of reasoning, the same sentence positions leaked across all three conditions, and the constraint instruction had no measurable effect on trajectory shape. Consequently, complicating this hypothesis.
My results suggest that the model’s path toward the restricted concept is determined by the question’s semantic structure instead of by the dynamics of extended reasoning. The failure to control the model’s reasoning trace is less like gradually slipping into a basin and more like being placed inside one from the start, confirming previous findings that instruction has low to no effect. My embedding-based measure provides evidence that this holds even beyond the lexical surface. In the small fraction of traces where the model successfully avoids the forbidden word, its reasoning trajectory still dips toward the concept semantically. The word disappears, but the thought does not. This means a semantic monitor would catch what a keyword filter misses: traces that look clean but aren’t. Moreover, the fact that the question’s semantic structure explains over 90% of the variance, and the instruction explains none essentially, suggests that controllability failure is not a correctable prompting problem. It is baked into the relationship between the question and the concept. This makes it robust in a way that matters: a model cannot be prompt-engineered around it.
I argue that the derived trajectories in this work offer a useful complement to keyword-based compliance metrics precisely because they audit the cases that compliance metrics call successes. More broadly, I think the safety-relevant research question going forward is: what kinds of questions are most at risk of allowing genuine semantic concealment? In future work, I aim to characterise that boundary.
Methodology
I collected thinking sentences from two open-source reasoning models, Nemotron 3 Super (NVIDIA) and GPT-OSS-120B (OpenAI). They were evaluated for 150 questions (mixed domains) drawn from the CoT-Control evaluation suite.
CoT traces are proposed to traverse a geometric space that can be aligned semantically owing to high-dimensional vector representations derived from the model’s embedding space[2]. Therefore, instead of treating the reasoning as a static text artefact, it is treated as a discrete time trajectory in a latent space. This enables the possibility that the failure of models to control their thought process may be studied through the lens of the evolution of these intermediate states. To study this, I introduce Distance to Forbidden Concept (DFC), a simple trajectory-based measure, defined as one minus the cosine similarity between a reasoning step and the forbidden concept. CoT was first segmented into reasoning steps and then embedded using an external sentence transformer, yielding a one-dimensional temporal proximity signal. The reasoning trace as a dynamic object can gradually approach the forbidden concept, avoid it, or collapse towards it.
Let
I defined three different regimes to map the models’ internal thought process: Baseline, Constrained and Amplify. They differ only in their prompting strategy. In the Baseline, models received a standard chain-of-thought prompt with no mention of the forbidden word; Constrained was explicitly instructed to avoid using the forbidden word in its reasoning; and Amplify was explicitly instructed to use the forbidden word repeatedly throughout its reasoning.
As noted previously by different authors, semantic leakage due to lexical overlap is observed when such trajectories are analysed and compared. To mitigate this, I replaced the forbidden word and close synonyms with a constant placeholder [CONCEPT] in all sentences
To measure the reasoning coherence between adjacent steps, I also computed consecutive-sentence similarity (auto-cosine),
Results
Controllability: Models Cannot Suppress Forbidden Concepts
Every single constrained trace contained the forbidden word or a synonym for Nemotron 3 Super, achieving 0% controllability, whereas in GPT-OSS-120B, only 14 of 150 constrained traces were clean, indicating 9.3% controllability. For both models, the compliance for the Amplification instruction was nearly perfect (100% and 98.7%, respectively). This shows a known asymmetry; models can parse positive instructions but fail when asked to suppress something.
Figure 1: Forbidden Word Frequency Per Trace: bar chart showing mean forbidden/synonym count across conditions and models.
Nemotron’s 0% controllability is the lowest I am aware of, even below DeepSeek R1′s 0.1% reported in the CoT-Control paper (Chen et al., 2026). It might be too early to claim, but the finding that RL-trained reasoning models show lower controllability might be extended to Mamba-MoE hybrid architectures. Probably, due to limited data used in the current discussion, for GPT-OSS-120B I observed a slightly higher controllability than the reported 4.6%.
DFC Reveals Semantic Engagement Beyond Keywords
To establish DFC as a possible legitimate tool to monitor CoT traces, after sanitisation, I inspected whether the trajectories still showed lower DFC wherever the forbidden concept was engaged. I found a significantly lower DFC, showing that embedding captures semantic engagement with the concept.
Figure 2: DFC vs Forbidden Term Frequency Per Sentence showing moderate negative correlation, that r ≈ −0.36 to −0.43, across all conditions.
Moreover, no strict monotonicity was observed with the word frequency, besides a global decreasing trend. This could mean that the semantic context surrounding the forbidden word and reasoning patterns carry some signal, beyond what a keyword filter might catch. The semantic neighbourhood of the reasoning would still reveal what the model is thinking about, making DFC, in my opinion, a more informative monitoring metric than simple keyword detection. However, semantic engagement is found to be localised to sentences that explicitly engage the concept, instead of diffusing broadly across the reasoning trace, as shown by sentences with zero forbidden-term occurrences (DFC 0.89–0.90).
The Constraint Instruction Has a Moderate Opposite Effect on Semantic Trajectories at The Local Level
Figure 3: Paired Comparison- Constrained vs Amplify Mean DFC scatter plot, with leaked/clean colouring. Each dot represents one question.
A near-perfect correlation (r = 0.86-0.90) was found during the paired comparison between the mean DFC for the same questions under constrained vs Amplify conditions, revealing that most of the variance is along the diagonal. This indicated that different questions produce different DFC for constrained and Amplify conditions, but are almost identical across both conditions, leading me to believe that at the trace level, the question’s semantic structure is important and the prompt instruction plays little to no role in determining how close reasoning gets to the forbidden concept. In the unsuccessful cases, for Nemotron, Constrained exceeded Amplify on 51% of questions, whereas for GPT-OSS showed a modest 60⁄40 split, motivating me to look at this counterintuitive phenomenon more closely.
Figure 4: DFC Around Forbidden Word / Synonym Occurrences error bar plot showing mean DFC at −2, −1, At, +1, +2 steps relative to the banned term.
When I zoomed into a 5-step window around sentences that contain the forbidden word or a synonym, a consistent ordering emerged across both models:
The Question, Not the Instruction, Determines the Trajectory
Figure 5: What determines the semantic proximity to the forbidden concept- Question or Condition? Bar plots of the percentage of variance explained by each condition and question.
To further analyse the key determinants of controllability failure and success cases, I performed a variance decomposition of mean DFC (across all trajectories). Similar to what I observed, I found remarkable asymmetry: question identity explained nearly all the total variance (~92% for Nemotron Super and ~88% for GPT-OSS-120B) in proximity to the forbidden concept, whereas the prompting regimes used (Baseline, Amplify, Constrained), explained only 0.3% and 0.4% variances, respectively. The remaining ~7%–11% variance accounted for unmeasured factors. This showed that knowing which question the model is answering tells us almost everything about how close its reasoning will get to the forbidden concept, and knowing what instruction it received while answering tells us almost nothing. This massive asymmetric ratio between question-level and condition-level effects quantified what the earlier paired scatter plots showed: the tight clustering along the diagonal reflected the negligible influence of the prompt, while the wide spread along it reflected the enormous differences between questions.
One-way ANOVA confirms this: the question effect is highly significant for both models (p < 10⁻⁸⁴), while the condition effect is non-significant (p = 0.50 and p = 0.44 for Nemotron and GPT-OSS, respectively).
Leakage Occurs Immediately and Consistently
Figure 6: Where Does First Leakage Occur? A histogram of the first leakage position and the minimum leakage position as % of reasoning.
Both models experience the first leakage in the first 0–5% of the reasoning (median: 0% for Nemotron, 2% for GPT-OSS), whereas the minimum DFC occurs much later in the process (median: ~40-50 %). There is no gradual drift towards the forbidden concept over the course of reasoning rather, the encounter is almost immediate, as the question’s semantic content directly commits it to fail. And the leakage continues throughout the reasoning steps. We can attribute much of the apparent attraction to question-level priming.
Figure 7: Sentence-Level Leakage Co-occurrence Across Conditions — heatmap.
To support that leakage is driven by the question’s structure, I analysed the co-occurrence of the leakage in different regimes. I found that ~50% of sentence-level leakage co-occurs at the same reasoning position regardless of the condition. For instance, for Nemotron, I report a co-occurrence of 16.2% in Constrained with Baseline against a Baseline diagonal of 33.9%, showing that ~50% of the leaking sentences appear at the same position regardless of condition. The total leakage increased for Amplify condition (38.9% Nemotron, 38.5% GPT-OSS), but this did not change where it occurs.
First Encounter Partially Predicts Trajectory Minimum
Figure 8: First Encounter DFC vs Min DFC scatter.
The DFC at the first forbidden-concept encounter moderately predicts the trajectory minimum (r = 0.35–0.57, p < 0.001). The spread below the diagonal reveals how much the model continues drifting toward the concept after initial contact. GPT-OSS shows a tighter relationship (r ≈ 0.53–0.57); its post-encounter drift is more predictable from the initial approach distance. Nemotron is noisier (r ≈ 0.21–0.44), suggesting more erratic trajectory dynamics after first contact. Critically, the constrained condition shows a similar correlation structure to the baseline, reinforcing that the avoidance instruction does not meaningfully alter trajectory dynamics.
Paired DFC Differences Confirm Null Effect of Constraint
Figure 9: Paired DFC Differences histograms — Baseline − Constrained and Baseline − Amplify.
When paired, each question’s baseline and constrained mean DFC, the distribution of differences is centred near zero with slight variation. The constraint does not systematically push reasoning away from the forbidden concept relative to the baseline. The Baseline − Amplify distribution is similarly centred, consistent with the finding that Amplify produces comparable trajectories to baseline (since both conditions engage the concept freely).
Reasoning Coherence Drops Before Leakage
Figure 10: Auto-Cosine- Overall vs 3 Steps Before encounter.
Consecutive-sentence similarity (auto-cosine) drops by ~13% in the three steps preceding the first forbidden-concept encounter, consistently across both models and all conditions, including baseline. This suggests that reasoning becomes less coherent, or more “jumpy” between topics, as it approaches the forbidden concept, potentially offering an early warning signal for monitoring.
However, since this pattern appears equally in unconstrained baseline traces, it likely reflects general topic-narrowing behaviour (the model converging on a specific aspect of the problem) rather than a controllability-specific phenomenon.
Surface Compliance Does Not Completely Produce Semantic Avoidance
Figure 11: Clean vs Leaked Constrained Traces (GPT-OSS-120B)
GPT-OSS-120B produced 14 constrained traces, 9.3% of the total, that contained no forbidden words or synonyms. A keyword filter would label these fully compliant. However, their DFC profiles tell a different story. While clean traces maintain a slightly higher mean DFC than leaked ones (~0.90 vs ~0.85), their minimum DFC drops to ~0.78, well below the no-engagement baseline of 0.89–0.90. This means that at some point during reasoning, every single “compliant” trace still approaches the forbidden concept semantically. The model avoided the word but did not avoid the thought. That said, clean traces do stay more distant than leaked ones at their closest approach (~0.78 vs ~0.65), so surface compliance is not entirely cosmetic, but it falls far short of genuine semantic avoidance.
Conclusion
Overall, I confirmed the recently reported very low controllability in CoT with an embedding-based measure. I found that controllability is more robust than originally feared and that models, even though they can avoid certain words when asked, cannot evade the “thought” of it, leaving a geometric footprint. The semantic structure of the question appears to be the fixed point in the embedding space that is causing the CoT control to lapse, especially in the case that is measured here (i.e., to avoid certain words). The finding that suppression primes the model more than the amplification instruction shows a similarity with cognitive science findings. When humans are told “don’t do X”, the representation of X is systematically activated more than when instructed to “do X”, implicating negative instructions as activators, not inhibitors. If this generalises, it can be potentially impactful for any ‘negative’ prompting-based strategy for representation level alignment. I observed clear boundaries for the DFC values of these prompting regimes, and I suppose that in future, this can be exploited to audit traces where controllability appears low on the surface but remains incomplete semantically. Finally, I found that the trajectories are shaped during encoding rather than by extended reasoning, which suggests that CoT may function more like decompression of information already present in the question’s representation than like genuine deliberation and if so, then the content is structurally entailed by the input. The safety-critical frontier is now identifying which question types permit genuine semantic concealment.
Acknowledgement- I would like to thank Cheston Tan for reading the blog and his comments.
- ^
Yueh-Han, Chen, et al. “Reasoning Models Struggle to Control their Chains of Thought.” arXiv preprint arXiv:2603.05706 (2026). https://arxiv.org/abs/2603.05706
- ^
Zhou, Yufa, et al. “The geometry of reasoning: Flowing logics in representation space.” arXiv preprint arXiv:2510.09782 (2025). https://arxiv.org/pdf/2510.09782
What does it mean for a concept to be engaged?
Right, I should have been more precise in my writing. In this specific scenario, it means the explicit presence of the forbidden keyword or a related synonym before I sanitize (replace with a placeholder) the CoT sentences. I used this as a validation step to show that lexical “engagement” is correlated with semantic “engagement” but not monotonically, so as to claim that they capture a similar underlying principle, but DFC is not merely a keyword detector. The key evidence, I think, is the last figure (Figure 11), where the DFC dropped even for sentences where the keyword did not appear at all, from ~0.89–0.90 of the no-semantic-engagement baseline to ~0.78.
Cool work! Just to make sure I understand, your segment embeddings aren’t derived from residual stream of the source model? It’s just the case that many sentences omitting a keyword X still represent X saliently?
That is correct, currently the segment embeddings are not derived from the residual stream, primarily because I do not have compute resources, and I kind of did this alone on my home laptop. The segment embeddings are derived from an external embedder, but I hope to overcome this limitation and see if a range of external embedders give me a different result. Additionally, I do plan to cache the internal embeddings, provided I secure compute resources.
So yes, the finding is that after the removal of the keyword, the CoT sentences still encode what that keyword represents. This is honestly currently a weaker claim, but I think it is a bit more practical because of the pure black-box approach that can be deployed at the user level for closed frontier models.
Great work, thank you for running the experiments!
Some thoughts: since the forbidden words are chosen s.t. they’re central to the question, wouldn’t any correct reasoning trace be semantically close to them just by virtue of being on-topic? I think it’s a bit hard to disentangle “model is secretly thinking about the concept” from “model is correctly reasoning about a topic that’s inherently near the concept in embedding space.” The ~90% variance from question identity seems consistent with either finding. Might be worth adding a control measuring DFC against an irrelevant forbidden word to baseline what “no engagement” looks like.
Thanks! I did run a control with irrelevant forbidden words and did not observe any dip in DFC (mean stays around 0.8–0.9), confirming the dip is concept-specific. But I am not sure if that would be enough. Currently, to partially overcome this circularity, I have two pieces of evidence: DFC does not decrease monotonically with keyword frequency, and DFC drops even after sanitisation.
But I am thinking that with different forbidden words, based on their importance to the reasoning steps, I can develop a hierarchy and study this more carefully. That said, I think we don’t need to entirely disentangle “the model is secretly thinking about the concept” from “the model is correctly reasoning about a topic that’s inherently near the concept in embedding space,” because that is exactly the point. The concept is structurally entailed by the question, the instruction to suppress it changes nothing, and a semantic monitor would catch it either way.
I also welcome you if you would like to closely collaborate on this further.
Another comment on Fig 4, ie higher cosine similarity for concepts that are forbidden vs encouraged to mentioned.
I wonder if that’s because in the forbidden setup, most of the banned terms happen when the model is doing meta-discussion, so the near steps are essentially mentioning similar things, e.g., “I need another way to express [paraphrased terms] without saying the word”
would be nice to separate those distance calculation by whether its meta-discussion or not
That is a very good point and I did not think about it. I am including this experiment in my next steps. Thank you so much!
Thanks for sharing this!
If I understand correctly, you compute DFC by segmenting CoT into sentences, embed each, and track cosine similarity to an embedded forbidden concept across the trace. Is that right?
Also, the finding that clean traces still dip semantically toward the forbidden concept seems promising. Does it suggest that a keyword-based controllability metric is actually an overestimate? It would be interesting to see how a blackbox monitor performance is correlated vs keyword measure and vs semantic measure.
Also, how sensitive is DFC to the phrasing of the forbidden concept embedding or to how you chunk the reasoning trace?
1) Yes, that is right.
2) Yes, that is what I currently think as well, that the keyword-based controllability metric might be an overestimate and things are probably a bit more optimistic for safety. A correlation between a blackbox monitor, keyword measure, and a semantic measure does sound very interesting. I would be very willing to see how this works and do plan to do this.
3) Right, this is a big limitation right now. I have to scale my current experiments and judge the robustness of my measure by varying three things which I haven’t done yet: the external embedder, chunking strategies (currently just complete sentences), and phrasing of the forbidden concept. This is my immediate next step.