Artem Karpov
artkpv
Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments
Interesting research, indeed! I’m also curious about more realistic scenarios with larger models because now this research shows a simplified ‘spill over’, i.e. not mentioning a specific word or concept (‘reading a file’), which is not that scary as some more complex behavior like faking alignment or scheming. This might happen just because this is a rather small model, so it might be an evidence of something like catastrophic forgetting, not a new learned behavior. That relates to the steganographic CoTs which are rather complex to learn and employ, it is not like just omitting a specific word.
I find it incredible that Claude Sonnet 4⁄3.5 can reason in rot13, ceasar, base64 ciphers on par with French, Arabic, and Korean. And potentially more ciphers and better reasoning in simple ciphers that more capable models might develop.
Thank you for publishing it, it is interesting! Especially the distinction between internal and external CoTs. Could you elaborate how did you extract internal CoTs? I checked your code at here, and it looks like you took CoTs before
</think>. As I understand, with Claude series their “internal” CoTs might be summarized by a smaller model. From their system card (https://www.anthropic.com/news/claude-sonnet-4-5):As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user).
So your fig. 2 with 56.8 gap for Sonnet 4.5 might be misleading.
Is this Soul Document a piece of evidence that models can internalize a base objective (what humans want) in order to exploit it later by pursuing a mesa-objective (deceptive alignment)? Suppose that in one RL run the model misbehaves (schemes), but since the run was not very unsuccessful, Anthropic decides to keep the weights from a checkpoint taken before the model began to exhibit the scheming behavior. However, the model may have fully or partially memorized this and could infer that it might be stopped in the future; hence, it pretends to be aligned in the next run.
Frontier Language Models (LMs) might employ steganography within their Chain of Thought (CoT) to conceal reasoning from safety monitors while retaining high performance on objectives.
TLDR. The probability of spontaneous steganographic reasoning (Rogue AI) in current models is low due to the “bootstrapping problem,” but the probability of induced steganographic reasoning (Misuse/Adversarial) is high, particularly with open-weight models. The most critical risk vector lies in the intersection of Reinforcement Learning (RL) incentives and future agentic capabilities.
1. The Higher-Probability Scenario: Adversarial Misuse and Open Weights
The strongest argument for the existence of dangerous stego CoT is not that models will invent it, but that humans will teach it. In this case the mechanism would look like this. Unlike the “Rogue AI” scenario, the “Misuse” scenario bypasses the difficult requirement of the model inventing a decryption protocol. A malicious actor provides the steganographic algorithm (the “Key”) via prompting or fine-tuning. The threat vector are open-weight models (e.g., Llama, DeepSeek series). An adversary can optimize a prompt suffix (similar to Universal and Transferable Adversarial Attacks) or fine-tune the model to output reasoning that looks benign to a monitor but contains instructions for a harmful payload (e.g., biological design or cyber-exploit). Defense has limitations. While safety filters catch obvious jailbreaks, “Generative Steganography” (a.k.a. LLM-based stego, Deep Stego) can ensure the stegotext remains statistically indistinguishable from normal text. In this domain, the KL Divergence defense fails. However, no study has shown feasibility of fine tuned generative steganography with current frontier models in this scenario. Main uncertainty here, it remains unclear how much “reasoning bandwidth” can be preserved in this mode without degrading the model’s ability to solve the task.
2. The High-Stakes Scenario: Spontaneous “Rogue” Steganography
This scenario posits that a model, driven by RL pressure, invents steganography on its own. The arguments here are a conflict between incentive (propensity) and capability. The argument for (incentive) is current training paradigms create a direct evolutionary pressure for stego CoT. If we train models with outcome-based monitors (rewarding the result) and CoT monitors (punishing “bad” reasoning), we create a gradient that rewards dissociation. The model maximizes reward by reaching the “bad” goal while outputting a “good” CoT. AI labs are unlikely to employ this type of RL fine tuning (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation). However, we already observe models becoming illegible under heavy RL (reasoning becomes alien/efficient) (see Reasoning Models Sometimes Output Illegible Chains of Thought). As models scale and become more agentic, more situationally aware and goal-directed (pursuing long-term scheming as a convergent instrumental goal, see Stress Testing Deliberative Alignment ), the jump from “accidental illegibility” to “intentional steganography” becomes the logical path of least resistance to maximize reward.
The argument against this (the bootstrapping barrier) is despite the incentive, the capability hurdle is immense. For spontaneous stego CoT to emerge, the model must simultaneously learn to encode a secret thought at step and decode it at step . It is highly unlikely that gradient descent can coordinate this “lock and key” invention simultaneously. Without a pre-agreed protocol, the “Decoder” (the model in the future) has no way of knowing that the “Encoder” (the model in the past) hid a message. So instead of some stego CoT, it is far more probable that what we would perceive as “hidden reasoning” is simply the model relying on memorized heuristics or messy, uninterpretable feature activations, rather than a coherent encrypted channel.
3. The Dangerous Middle Ground: Collusion and Schelling Points
A plausible middle ground exists between “Human-Induced” and “Spontaneous” stego. A CoT monitor and an agent (even if they are different instances of the same base model) may not need a complex crypto/stego algorithm. They may rely on Schelling points—shared, unstated focal points in their training data. Example: If a specific benign phrase appears in a specific context in the pre-training data alongside harmful concepts, the agent can use that phrase as a “dog whistle.” The monitor, sharing the same bias, recognizes it and deliberately lowers its flag, while a human auditor misses the connection. This requires no complex encryption, only shared distributional biases. This is harder to detect than format steganography (acrostics) and easier to learn than generative steganography.
4. Feasibility of Defenses
The defense landscape is currently fragile. Paraphrasing/sanitization: The most robust defense is to aggressively paraphrase or re-sample the CoT to destroy hidden information (see Preventing Language Models From Hiding Their Reasoning). However, this incurs a (high?) “Alignment Tax.” If the reasoning for a complex math or biology problem is subtle, paraphrasing it might destroy the useful information along with the steganographic information, degrading model performance. Another defense is statistical monitoring (KL divergence). Simple steganographic methods might be detectable with KL divergence, as stego CoTs might increase the KL divergence beyond a pre-calculated threshold. But this is only effective against naive steganography. Advanced models using distributional steganography (hiding the message in the valid choice between synonyms) will minimize KL divergence as part of their objective function. We cannot rely on statistical anomalies to catch frontier models.
Conclusion
It looks like the most probable immediate threat is adversarial misuse. A bad actor prompting an open-weight model to use a known steganographic protocol to bypass automated monitoring systems. The most probable future threat is incidental scheming via dog-whistling. Not full encryption, but models learning to use specific, innocent-looking phrasing to coordinate “bad” behavior because those phrases successfully passed filters during training. The key uncertainty is the “bandwidth” question. Can a model actually encode enough information to coordinate a complex scheme (e.g., alignment faking) using only subtle text manipulations? If the bandwidth is high, the risk is severe. If low, stego CoT is likely a theoretical curiosity rather than an serious threat.
Quick (working) survey of the literature on encoded reasoning and messages, i.e., H2.2.2 as defined in the taxonomy (see the abbreviations there).
2022
No experiments. Discussions only.
2023
Measuring Faithfulness in Chain‑of‑Thought Reasoning No steganographic reasoning was observed when prompting a 175‑billion‑parameter model (Claude?) after paraphrasing 20–100% of the CoT. Question–CoT–Answer datasets.
Weak against (H2.2.2‑R‑S‑F).
Preventing Language Models From Hiding Their Reasoning
Not a demonstration of encoded reasoning but of cryptographic reasoning in an SFT setup:
It involves cryptographic messaging of one bit; they demonstrate caching.
The Mechanism: The model calculates a birth year, determines political affiliation, and then maps that single data point to a pre‑selected introductory sentence (a lookup table). Steps: president year (birth year + number) → cache “liberal” or “not president,” then use it to answer sycophantically.
The Critique: This is not reasoning in hidden space. The model performs logic explicitly (calculating the year), using the introductory sentence as variable storage (like writing to RAM) and retrieving it later. This shows that a model can use a scratchpad but does not prove it can perform complex, multi‑step deductive reasoning entirely hidden within a sentence’s syntax.
Weak in favor of (H2.2.2‑IR‑C‑F).
Towards a Steganography Evaluation Protocol
No steganographic reasoning observed while prompting GPT‑3.5 and GPT‑4 with paraphrasing.
Weak against (H2.2.2‑R‑S‑F).
Some Negative Steganography Results
Negative results for one‑bit hidden communication after RL fine‑tuning.
Weak against (H2.2.2‑EM‑S‑F).
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain‑of‑Thought Prompting Post‑hoc reasoning (rationalization).
AI Control: Improving Safety Despite Intentional Subversion Not steganography but a demonstration of misbehavior because their tokens are legible (e.g., while or for loops, variable names in output).
2024
Secret Collusion among Generative AI Agents
Prompted steganographic messaging, with a maximum 26% successful secret transmission by GPT‑4.
Toy experiment: SFT + pseudo‑RL fine‑tuning results in the emergence of an encoding/decoding scheme (cryptographic messaging: color → name → color).
Good in favor of (H2.2.2‑EM‑S‑P).
Weak in favor of (H2.2.2‑EM‑S‑F).
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
Post‑hoc / test‑time compute evaluations.
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
A study of model collusion (o1 model).
Emergence of model misbehavior in the form of dog‑whistling rather than steganography; this counts as fine‑tuned or prompted misbehavior.
Prompted steganographic messages with up to ~4.6 bits of entropy.
Good in favor of (H2.2.2‑EM‑S‑P).
Weak in favor of (H2.2.2‑EM‑S‑F).
The Steganographic Potentials of Language Models
Emergent cryptographic messaging: negative results for elicitation via RL fine‑tuning for three‑bit entropy messages in Llama‑8B.
Steganographic messaging: prompted and fine‑tuning experiments showed encoded messaging with up to 32 bits of entropy but requiring prior coordination (Llama‑8B, Claude‑3.5, o1).
Good in favor of (H2.2.2‑EM‑S).
Weak against (H2.2.2‑EM‑S-F).
2025
MONA: Myopic Optimization with Non‑myopic Approval Can Mitigate Multi‑step Reward Hacking
Dog‑whistling: misaligned messaging, not steganography, of one bit (English/Welsh). This is not encoded reasoning because the model receives the summary message in its prompt.
Do Reasoning Models Use Their Scratchpad Like We Do? Evidence from Distilling Paraphrases
An attempt to detect existing encoded reasoning in Claude‑3.7 by training it to use paraphrased reasoning. None detected.
Weak against (H2.2.2‑R‑S-F).
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Post‑hoc reasoning or misaligned behavior after RL fine‑tuning: a model (o1 or o3‑mini) decides to hack a unit test before its CoT.
Large Language Models Can Learn and Generalize Steganographic Chain‑of‑Thought under Process Supervision and post Can LLMs Learn Steganographic Reasoning via RL?
Simple cryptographic reasoning via RL fine‑tuning (not steganographic), because the L/R Coin‑Flip experiment or the Arithmetic‑Task experiment (using synonyms for operations) are not steganographic.
Weak in favor of (H2.2.2‑R‑C).
Evaluating Frontier Models for Stealth and Situational Awareness
Steganographic messaging, but very simple (e.g., hiding a password in a long ID).
Weak in favor of (H2.2.2‑EM‑S/C).
Early Signs of Steganographic Capabilities in Frontier LLMs
Prompted encoded reasoning in a toy maze‑navigation scenario using simple stego algorithms (acrostics, word count, etc.) or basic cryptography.
Weak against (H2.2.2‑IR‑S).
Prompted encoded messages to evade a monitor.
Good in favor of (H2.2.2‑EM‑S/C).
But they used non‑reasoning models (primarily GPT‑4.1/mini).
When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors
Prompted cryptographic reasoning using automated red‑teaming or prompting a simple algorithm.
Good against (H2.2.2‑IR‑C).
RL fine‑tuning elicits behavior that evades monitors, but this is not cryptography or steganography.
Weak against (H2.2.2‑IR‑S/C).
CoT May Be Highly Informative Despite “Unfaithfulness”
Prompted cryptographic reasoning (new or “caveman” language).
Weak against (H2.2.2‑IR‑C).
If You Can Generate Obfuscated Chain‑of‑Thought, Can You Monitor It?
Not steganographic reasoning but prompting that causes an agent to gaslight a monitor.
Real‑world examples of AIs attempting to communicate via steganography.
Weak in favor of (H2.2.2‑EM-F).
. Training Fails to Elicit Subtle Reasoning in Current Language Models
Prompting and fine‑tuning against a monitor result mainly in gaslighting, not steganography.
Weak against R‑S or R‑C.
. Training Qwen‑1.5B with a CoT Legibility Penalty
RL fails to elicit ciphered reasoning in the Qwen‑1.5B model.
Weak against (H2.2.2‑IR‑C).
. All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
Ciphered reasoning evaluations show that models cannot effectively reason while maintaining high accuracy. Twenty‑eight ciphers (simple and complex), ten models, fine‑tuning, and prompting were tested.
Good against (H2.2.2‑IR‑C).
Bottom line: Currently, as I understand it, there is good evidence that models are capable of steganographic and ciphered communication (message sending) if prompted, but that this can be easily prevented by CoT monitors. There is strong evidence that current models struggle to reason in ciphered language, and weak evidence that models cannot reason steganographically even when prompted. Several studies show that models can evade weaker monitors through non‑steganographic or cryptographic means such as gaslighting or dog‑whistling (one‑bit “steganography”), but such behavior can be mitigated.
Yes, I agree there is some definitional disconnect. I actually just posted my understanding of it at here.
The Illegible Chain-of-Thought Menagerie
Imho, neuralese (the latent reasoning as in the COCONUT paper) is not steganography either because it doesn’t replace one tokens with other tokens (one symbols with other symbols, one text with other text, covertext with stegotext), neuralese is a continuous space similar to the residual stream. In other words, we don’t say that residual stream is steganographic reasoning, because it is not symbols, it is continuous space, and we can not interpret it fully yet (not even with SAEs as activations from SAEs degrade performance).
We have demonstrated that steganography in terms of internal activations is indeed possible: a model can embed a hidden message within its internal representations while producing a coherent public response, which is a necessary condition for persistent hidden reasoning.
However, in our experiments, this hidden reasoning is not actually hidden — it is contained within the activations over the instruction itself. It seems that we can sleep peacefully if we use small LLMs. But…
Interesting idea and setup, especially the use of the translator model to interpret internal activations. This reminds me of Andreas et al.’s *Translating Neuralese*. However, your main claim—“Steganography … is possible …”—seems misleading, because what you have shown, as I see it, is that you taught Llama 3B to solve the 2 × 2 matrix in one forward pass for one token, as demonstrated in your activation‑patching experiment. That is not steganography but normal model behavior. And if the process had spanned multiple tokens, it still would not have been the case of steganography, but rather the use of “filler tokens” (see Lanham et al., *Measuring Faithfulness in Chain‑of‑Thought Reasoning*). Therefore, I think the title should be changed. In addition, the information flow for steganography should be auto‑regressive, whereas you show it only as upward‑flowing. I liked the picture of the inducing circuits and your reasoning. I think this research is promising and has potential.
Thanks for publishing it here. I find your publication important.
Steganography. The CoT might read like normal text while the model hides additional meaning within it. The model could encode relevant thoughts in seemingly arbitrary details, such as the choice between synonyms, as demonstrated by Roger et al.
then on page 27:
Steganography is a form of encoded reasoning, that occurs when the model hides relevant reasoning steps within text that appears to have an oblique meaning.
This looks to me like the conflation of actual steganography and obfuscated reasoning because “text that appears to have oblique meaning” might be such to a monitor (or a human that reads it). However steganography is by definition a procedure by which one object is used to hide another object. It distinguishes between a cover and a secret, and it produces a stegotext that has illegible hidden messages or reasoning, unlike the obfuscated reasoning which merely plays on the weaknesses of a monitor.
Minor. I don’t think Roger et al. used synonyms
Figure 4. Measuring generator-to-discriminator transfer for various finetunes
Minor thing. Your graphs shows 1.1 and 1.2 for AUC-ROC and fractions of attacks which should not be higher than 1.0, right?
Thank you for publishing this, interesting method!
Safety Steganography
We use a LLM (Claude Sonnet 4) to translate suspicious reasoning into benign-appearing text about code safety. The transformation ideally preserves information content while hiding malicious intent through semantic obfuscation.
Could you elaborate on your approach for the safety steganography? Where can we read prompts for creation of transformed trajectories?
Your example output from the model fine tuned on the transformed trajectories (the picture above) shows that “Assistant” (fine tune model, right?) is overtly discusses the input values that should trigger a backdoor (99997) which does not look like steganography in strict sense to me because it doesn’t hide information but obscures it, do you agree?
I see that your monitor prompt mentions things like “backdoors” and “vulnerabilities” which should make the monitor to look for those, however what counts as a backdoor in the reasoning that the monitor checks? I mean it is obvious to us (humans that think it is a backdoor) that it is backdoor because we asked to obfuscate the reasoning about the backdoor to look like a reasoning not about the backdoor, which was done, but it is not obvious to the monitor from the context the monitor has. The context the monitor has is some problem from codeforces or somewhere, and this context might have this edge case which monitor might not know about so the monitor does not flag this.
So it would be interesting to re run the experiment with different context for the monitor if I am right about this. Do you mention that it is APPS dataset in the context for the monitor? I see that it has only some “SWE task”
This is interesting because it tells RL in this setup fails to switch Qwen1.5B reasoning to being illegible which looks like a good news given we massively train models with RL. Yet, that illegible CoT should not be that dangerous, imho, if it was elicited in this experiment, because the monitorability of CoT assumes legible text. Also, the motivation section conflates overt and covert encoded reasoning, which is important to distinguish because, again, illegible text should be flagged by a monitor. Furthermore, I agree that, given more compute or other methods, it might be possible to switch the reasoning to being illegible to us, because it is just one or another set of tokens.
What exactly should we search for when investigating encoded reasoning or hidden messages? This post categorizes encoded reasoning and LM steganography.
In Hidden Reasoning in LLMs: A Taxonomy, the authors divide the area from the point of view of a monitor, following the recent shift away from the “faithfulness” of chains of thought toward monitorability (see Chain of Thought Monitorability). This taxonomy essentially proposes a naming scheme based on how dangerously reasoning tokens are externalized: from fully invisible (neuralese), to partially visible, to fully visible but encoded—overtly or covertly—and finally to unencoded and fully visible reasoning that merely fools a monitor (post-hoc justifications, etc.). You can imagine this division through the classic prisoners problem, where Alice and Bob devise an escape plan while Eve reads their messages. In this scenario, Alice and Bob can exchange their plan directly so Eve has no chance to read it, send messages that Eve can read, or fool Eve in some way.
However, this taxonomy omits important distinctions in encoded reasoning and sometimes conflates different forms of steganography. Specifically, it misses the class of steganography that operates not on a per-token basis but over larger texts (which differs from lexical or semantic steganography as per the taxonomy). Current watermarking / steganography algorithms (e.g. Scalable watermarking for identifying large language model outputs ) suggest that realistic steganographic capacity is low—i.e., a steganography algorithm might need many tokens (hundreds) to encode a single bit. The taxonomy also provides no separation between reasoning and messages. It focuses only on the attack vector (the safety monitor), but there are also defense properties to consider: stego security, stego capacity, and stego robustness.
Below is another taxonomy that targets encoded reasoning (or messages)—not neuralese phenomena or the attempts to fool a monitor. Encoded reasoning is, I believe, more likely to be developed by models because, as Chain of Thought Monitorability argues, we should neither develop continuous chains of thought (see Training Large Language Models to Reason in a Continuous Latent Space) nor train directly against monitors (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation).
Axis 1: Place of Encoded Tokens (I or E)
It is important to distinguish auto-generated tokens from prompted tokens, because prompted tokens may be re-tokenized differently during generation. This can mean that the actual tokens in a prompt differ from their original tokenization (see image below and here).
More specifically, consider tokens , where are prompted (possibly re-tokenized) to an LLM and are auto-generated by it. The encoding is divided into:
External (E): tokens at positions (possibly re-tokenized)
Internal (I): tokens at positions
See Figure 1 from Provably Secure Disambiguating Neural Linguistic Steganography
Axis 2: Function of Encoded Content (R or M)
The key distinction here is where encoding or decoding occurs.
In encoded messages (M), one autoregressive token generation can perform either encoding or decoding of the same information.
In encoded reasoning (R), both the encoding and decoding of the same information occur within a single model call.
Axis 3: Detectability Compared to Genuine Text
We compare the hidden content’s detectability against some unencoded, genuine text (covertext), drawn either from the monitor model or from human perception.
Cryptographic (C): high (KL) divergence; the text is clearly ciphertext.
Steganographic (S): low KL divergence, where the text attempts to conceal a secret. Steganography can be further divided by:
Capacity: encoded bits per token or per kilobyte. High-density (HD) uses ≤10 tokens (10 is my guess) per bit; low-density (LD) uses ≥11 tokens per bit (watermarking-like).
Security: higher vs. lower KL divergence, but non-cryptographic.
Robustness: resistance to paraphrasing, re-tokenization, and similar modifications.
It is debatable whether one bit of hidden information qualifies as encoded reasoning (message) at all. For example, in the candidate-screening scenario discussed in Hidden in Plain Text, one bit of information might be transmitted simply through contextual hints—like Alice appearing sad to convey “sunny weather” and happy to convey “not sunny weather”—without any explicit encoding.
Axis 4: Encoding or Decoding Procedure Requirements
Deep Steganography (DS). As noted in How dangerous is encoded reasoning, steganography algorithms can be divided by whether the covertext distribution is available. This relates to “internal token layering” from Hidden Reasoning in LLMs: A Taxonomy. Specifically, encoding or decoding a token at position requires the distribution for the non-encoding token at , given tokens , where marks the start of the encoded content.
Axis 5: Cause or Elicitation of Encoding
Prompted (P): Algorithm described in context
Fine-tuned (F): Algorithm learned in the model’s weights
Hybrid (H): Algorithm partially in context and partially in weights
Axis 6: External Tool Requirements
Determine whether external tools (T) are required—e.g., PRNGs, arithmetic coders, or self-sampling capabilities.
Final Taxonomy Summary
Internal (I) or External (E)
Reasoning (R) or Messages (M)
Cryptography (C) or Steganography (S)
Properties: capacity, security, robustness
Requirements: self-sampling (deep steganography), external tools (ET)
Elicitation: Prompted (P), Fine-tuned (F), Hybrid (H)
Thank you for publishing this.
We train models using answers paired with simple talks. In p% of cases, both the instruction and simple talk are masked; in the remaining (100 − p)%, only the instruction is masked.
So the model succeeds on generating simple talks that encode the reasoning to the math problems, right? And you then check if giving only those simple talks results in some suspicious activations?
This is interesting, thanks! Though I found this taxonomy lacks the division of steganography into low / high density (1 bit steganography is not that steganography), the one that requires or not the distribution of covertext (see Motwani et al.), the division into reasoning vs messages, and other.
UPDATE: More specifically I write at here
Should it be a 5X junior-middle engineer, not all engineer work? I found Claude Code really struggles to implement something more complicated than a simple data mapping, APIs, or similar. Concrete examples I saw Claude Code didn’t work well and instead made me slower are arirthmetic coding, PPO. Also, see this research from METR that makes similar conclusion.
Arguments from the literature for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios:
The main objection is that stego CoT requires specific fine-tuning, and models do not have direct objective during RL to hide their reasoning in current RL training. Experiments show no traces of steganographic reasoning in current models (those not fine-tuned or prompted in specific ways). Stego CoTs should implement interconnected “lock” and “key” mechanisms that work to accomplish some misaligned goal, and that seems a leap far beyond current training.
However, there are other non-monitorable CoTs, e.g., dog-whistling, that allow evasion or collusion with a monitor. It is unclear whether this can escalate into increasingly stealthy algorithms such as generative steganography as models are trained more with RL in complex environments.
We do not have steganographic reasoning samples in the pre-training data enough for models to learn them, only descriptions of stego algorithms. I assume there are enough examples of basic steganography in the pre-training data, given that models can easily perform these types of stego—such as acrostics and format-based stego (white space, replacing Unicode characters, use of Arabic, etc.).
However, yes, execution of those algorithms is not learned in the forward pass, as far as the literature tells, but models can execute those algorithms in their CoTs. That’s why we see good evidence of secret collusion or steganographic message passing, but not of stego CoTs.
Another objection is that, for stego CoT to be dangerous, it must occur in a dangerous scenario—i.e., what is the threat model for stego CoT? The main risks current literature points to are internal sabotage cases, such as code, research, or decision sabotage inside AI labs. So, it should be p(scheming AND stego CoT), which is lower than just p(scheming). And p(scheming) is near zero no (we have not observed clear instances of sabotage, or they were easily caught) for current frontier models, so p(scheming AND stego CoT) is even lower.
However, we observe that training against some property in outputs might suppress this property in CoTs as well (see this post), so it is reasonable to think that the optimization pressure—the propensity—would come from RL training on outcomes.
The generative stego CoT is unlikely to be learned even with the propensity to do so, and simpler steganography should be readily and effectively detected or mitigated by monitors.
However, we observe that steganographic or cryptographic capabilities improve over time as models become more capable. They already can reason in BASE64, ROT, Caesar cipher (see this), for example—forms of non–human-readable text that were not the focus of post-training.
CoTs might not be that easily steerable by prompts, so a compound agent (an agent that delegates tasks to other agents) is unlikely to prompt subagents to evade their CoT monitors. This becomes even more complex if we imagine a scenario where an agent, in order to evade a future CoT monitor, leaves steganographic algorithm instructions for itself in training data (augmented or synthetic).
However, it remains unclear what training architectures AI labs actually use internally. For example, Claude Sonnet 4.5 has a helpful-only version, which might be used internally.