Artem Karpov comments on artkpv’s Shortform

Artem Karpov 29 Nov 2025 12:15 UTC
1 point
0
Frontier Language Models (LMs) might employ steganography within their Chain of Thought (CoT) to conceal reasoning from safety monitors while retaining high performance on objectives.

TLDR. The probability of spontaneous steganographic reasoning (Rogue AI) in current models is low due to the “bootstrapping problem,” but the probability of induced steganographic reasoning (Misuse/Adversarial) is high, particularly with open-weight models. The most critical risk vector lies in the intersection of Reinforcement Learning (RL) incentives and future agentic capabilities.

1. The Higher-Probability Scenario: Adversarial Misuse and Open Weights

The strongest argument for the existence of dangerous stego CoT is not that models will invent it, but that humans will teach it. In this case the mechanism would look like this. Unlike the “Rogue AI” scenario, the “Misuse” scenario bypasses the difficult requirement of the model inventing a decryption protocol. A malicious actor provides the steganographic algorithm (the “Key”) via prompting or fine-tuning. The threat vector are open-weight models (e.g., Llama, DeepSeek series). An adversary can optimize a prompt suffix (similar to Universal and Transferable Adversarial Attacks) or fine-tune the model to output reasoning that looks benign to a monitor but contains instructions for a harmful payload (e.g., biological design or cyber-exploit). Defense has limitations. While safety filters catch obvious jailbreaks, “Generative Steganography” (a.k.a. LLM-based stego, Deep Stego) can ensure the stegotext remains statistically indistinguishable from normal text. In this domain, the KL Divergence defense fails. However, no study has shown feasibility of fine tuned generative steganography with current frontier models in this scenario. Main uncertainty here, it remains unclear how much “reasoning bandwidth” can be preserved in this mode without degrading the model’s ability to solve the task.

2. The High-Stakes Scenario: Spontaneous “Rogue” Steganography

This scenario posits that a model, driven by RL pressure, invents steganography on its own. The arguments here are a conflict between incentive (propensity) and capability. The argument for (incentive) is current training paradigms create a direct evolutionary pressure for stego CoT. If we train models with outcome-based monitors (rewarding the result) and CoT monitors (punishing “bad” reasoning), we create a gradient that rewards dissociation. The model maximizes reward by reaching the “bad” goal while outputting a “good” CoT. AI labs are unlikely to employ this type of RL fine tuning (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation). However, we already observe models becoming illegible under heavy RL (reasoning becomes alien/efficient) (see Reasoning Models Sometimes Output Illegible Chains of Thought). As models scale and become more agentic, more situationally aware and goal-directed (pursuing long-term scheming as a convergent instrumental goal, see Stress Testing Deliberative Alignment ), the jump from “accidental illegibility” to “intentional steganography” becomes the logical path of least resistance to maximize reward.

The argument against this (the bootstrapping barrier) is despite the incentive, the capability hurdle is immense. For spontaneous stego CoT to emerge, the model must simultaneously learn to encode a secret thought at step $t$ and decode it at step $t + n$ . It is highly unlikely that gradient descent can coordinate this “lock and key” invention simultaneously. Without a pre-agreed protocol, the “Decoder” (the model in the future) has no way of knowing that the “Encoder” (the model in the past) hid a message. So instead of some stego CoT, it is far more probable that what we would perceive as “hidden reasoning” is simply the model relying on memorized heuristics or messy, uninterpretable feature activations, rather than a coherent encrypted channel.

3. The Dangerous Middle Ground: Collusion and Schelling Points

A plausible middle ground exists between “Human-Induced” and “Spontaneous” stego. A CoT monitor and an agent (even if they are different instances of the same base model) may not need a complex crypto/stego algorithm. They may rely on Schelling points—shared, unstated focal points in their training data. Example: If a specific benign phrase appears in a specific context in the pre-training data alongside harmful concepts, the agent can use that phrase as a “dog whistle.” The monitor, sharing the same bias, recognizes it and deliberately lowers its flag, while a human auditor misses the connection. This requires no complex encryption, only shared distributional biases. This is harder to detect than format steganography (acrostics) and easier to learn than generative steganography.

4. Feasibility of Defenses

The defense landscape is currently fragile. Paraphrasing/sanitization: The most robust defense is to aggressively paraphrase or re-sample the CoT to destroy hidden information (see Preventing Language Models From Hiding Their Reasoning). However, this incurs a (high?) “Alignment Tax.” If the reasoning for a complex math or biology problem is subtle, paraphrasing it might destroy the useful information along with the steganographic information, degrading model performance. Another defense is statistical monitoring (KL divergence). Simple steganographic methods might be detectable with KL divergence, as stego CoTs might increase the KL divergence beyond a pre-calculated threshold. But this is only effective against naive steganography. Advanced models using distributional steganography (hiding the message in the valid choice between synonyms) will minimize KL divergence as part of their objective function. We cannot rely on statistical anomalies to catch frontier models.

Conclusion

It looks like the most probable immediate threat is adversarial misuse. A bad actor prompting an open-weight model to use a known steganographic protocol to bypass automated monitoring systems. The most probable future threat is incidental scheming via dog-whistling. Not full encryption, but models learning to use specific, innocent-looking phrasing to coordinate “bad” behavior because those phrases successfully passed filters during training. The key uncertainty is the “bandwidth” question. Can a model actually encode enough information to coordinate a complex scheme (e.g., alignment faking) using only subtle text manipulations? If the bandwidth is high, the risk is severe. If low, stego CoT is likely a theoretical curiosity rather than an serious threat.