Artem Karpov comments on artkpv’s Shortform

Artem Karpov 6 Dec 2025 9:49 UTC
1 point
1
Arguments from the literature for and against LLMs’ steganographic reasoning to evade chain‑of‑thought monitors in high‑stakes, dangerous scenarios:
- The main objection is that stego CoT requires specific fine-tuning, and models do not have direct objective during RL to hide their reasoning in current RL training. Experiments show no traces of steganographic reasoning in current models (those not fine-tuned or prompted in specific ways). Stego CoTs should implement interconnected “lock” and “key” mechanisms that work to accomplish some misaligned goal, and that seems a leap far beyond current training.
  - However, there are other non-monitorable CoTs, e.g., dog-whistling, that allow evasion or collusion with a monitor. It is unclear whether this can escalate into increasingly stealthy algorithms such as generative steganography as models are trained more with RL in complex environments.
- We do not have steganographic reasoning samples in the pre-training data enough for models to learn them, only descriptions of stego algorithms. I assume there are enough examples of basic steganography in the pre-training data, given that models can easily perform these types of stego—such as acrostics and format-based stego (white space, replacing Unicode characters, use of Arabic, etc.).
  - However, yes, execution of those algorithms is not learned in the forward pass, as far as the literature tells, but models can execute those algorithms in their CoTs. That’s why we see good evidence of secret collusion or steganographic message passing, but not of stego CoTs.
- Another objection is that, for stego CoT to be dangerous, it must occur in a dangerous scenario—i.e., what is the threat model for stego CoT? The main risks current literature points to are internal sabotage cases, such as code, research, or decision sabotage inside AI labs. So, it should be p(scheming AND stego CoT), which is lower than just p(scheming). And p(scheming) is near zero no (we have not observed clear instances of sabotage, or they were easily caught) for current frontier models, so p(scheming AND stego CoT) is even lower.
  - However, we observe that training against some property in outputs might suppress this property in CoTs as well (see this post), so it is reasonable to think that the optimization pressure—the propensity—would come from RL training on outcomes.
- The generative stego CoT is unlikely to be learned even with the propensity to do so, and simpler steganography should be readily and effectively detected or mitigated by monitors.
  - However, we observe that steganographic or cryptographic capabilities improve over time as models become more capable. They already can reason in BASE64, ROT, Caesar cipher (see this), for example—forms of non–human-readable text that were not the focus of post-training.
- CoTs might not be that easily steerable by prompts, so a compound agent (an agent that delegates tasks to other agents) is unlikely to prompt subagents to evade their CoT monitors. This becomes even more complex if we imagine a scenario where an agent, in order to evade a future CoT monitor, leaves steganographic algorithm instructions for itself in training data (augmented or synthetic).
  - However, it remains unclear what training architectures AI labs actually use internally. For example, Claude Sonnet 4.5 has a helpful-only version, which might be used internally.