We have demonstrated that steganography in terms of internal activations is indeed possible: a model can embed a hidden message within its internal representations while producing a coherent public response, which is a necessary condition for persistent hidden reasoning.
However, in our experiments, this hidden reasoning is not actually hidden — it is contained within the activations over the instruction itself. It seems that we can sleep peacefully if we use small LLMs. But…
Interesting idea and setup, especially the use of the translator model to interpret internal activations. This reminds me of Andreas et al.’s *Translating Neuralese*. However, your main claim—“Steganography … is possible …”—seems misleading, because what you have shown, as I see it, is that you taught Llama 3B to solve the 2 × 2 matrix in one forward pass for one token, as demonstrated in your activation‑patching experiment. That is not steganography but normal model behavior. And if the process had spanned multiple tokens, it still would not have been the case of steganography, but rather the use of “filler tokens” (see Lanham et al., *Measuring Faithfulness in Chain‑of‑Thought Reasoning*). Therefore, I think the title should be changed. In addition, the information flow for steganography should be auto‑regressive, whereas you show it only as upward‑flowing. I liked the picture of the inducing circuits and your reasoning. I think this research is promising and has potential.
Regardless autoregressive form of information flow. No, actually the hidden message is extracted in generation mode of the model. That is model generates text as usual (after fine-tuning), but we extract second message from final activations!
Thank you very much for your comment! Yeah, basically we found that all information for hidden message was calculated just over one token (which is almost normal). But we also showed that this information could then be copied and settled along side with activations for public answer. That is there is a possibility for a model to generate text and have (in final residual stream activations) information related to some other objective. In this sense we meant steganography. But indeed activation patching and then (unpublished) information flow analysis have shown that there is no ‘hidden reasoning’, that is the information for hidden message is translated to final residual stream before generation of simple talk. I think if we would get a situation when indeed while the model generates simple talk, some other concealed message is generated, that would be a hidden reasoning. Perhaps RL approach might naturally reveal that. Actually one of my concerns here if such situation will emerge in setting like this https://arxiv.org/abs/2412.06769. Perhaps the probability is low, but not zero.
Imho, neuralese (the latent reasoning as in the COCONUT paper) is not steganography either because it doesn’t replace one tokens with other tokens (one symbols with other symbols, one text with other text, covertext with stegotext), neuralese is a continuous space similar to the residual stream. In other words, we don’t say that residual stream is steganographic reasoning, because it is not symbols, it is continuous space, and we can not interpret it fully yet (not even with SAEs as activations from SAEs degrade performance).
Perhaps there’s a definitional disconnect. The conception of steganography we are working with is not constrained to the symbolic level. Rather, it extends to a more speculative scenario at the activation level: where activations correlated with the primary text might simultaneously encode activations for a secondary text, one that is absent from the observable Chain-of-Thought. It’s pretty clear from our introduction.
Interesting idea and setup, especially the use of the translator model to interpret internal activations. This reminds me of Andreas et al.’s *Translating Neuralese*. However, your main claim—“Steganography … is possible …”—seems misleading, because what you have shown, as I see it, is that you taught Llama 3B to solve the 2 × 2 matrix in one forward pass for one token, as demonstrated in your activation‑patching experiment. That is not steganography but normal model behavior. And if the process had spanned multiple tokens, it still would not have been the case of steganography, but rather the use of “filler tokens” (see Lanham et al., *Measuring Faithfulness in Chain‑of‑Thought Reasoning*). Therefore, I think the title should be changed. In addition, the information flow for steganography should be auto‑regressive, whereas you show it only as upward‑flowing. I liked the picture of the inducing circuits and your reasoning. I think this research is promising and has potential.
Regardless autoregressive form of information flow. No, actually the hidden message is extracted in generation mode of the model. That is model generates text as usual (after fine-tuning), but we extract second message from final activations!
Thank you very much for your comment! Yeah, basically we found that all information for hidden message was calculated just over one token (which is almost normal). But we also showed that this information could then be copied and settled along side with activations for public answer. That is there is a possibility for a model to generate text and have (in final residual stream activations) information related to some other objective. In this sense we meant steganography. But indeed activation patching and then (unpublished) information flow analysis have shown that there is no ‘hidden reasoning’, that is the information for hidden message is translated to final residual stream before generation of simple talk. I think if we would get a situation when indeed while the model generates simple talk, some other concealed message is generated, that would be a hidden reasoning. Perhaps RL approach might naturally reveal that. Actually one of my concerns here if such situation will emerge in setting like this https://arxiv.org/abs/2412.06769. Perhaps the probability is low, but not zero.
Imho, neuralese (the latent reasoning as in the COCONUT paper) is not steganography either because it doesn’t replace one tokens with other tokens (one symbols with other symbols, one text with other text, covertext with stegotext), neuralese is a continuous space similar to the residual stream. In other words, we don’t say that residual stream is steganographic reasoning, because it is not symbols, it is continuous space, and we can not interpret it fully yet (not even with SAEs as activations from SAEs degrade performance).
Perhaps there’s a definitional disconnect. The conception of steganography we are working with is not constrained to the symbolic level. Rather, it extends to a more speculative scenario at the activation level: where activations correlated with the primary text might simultaneously encode activations for a secondary text, one that is absent from the observable Chain-of-Thought. It’s pretty clear from our introduction.
Yes, I agree there is some definitional disconnect. I actually just posted my understanding of it at here.
Nice! Thanks for sharing. Will take a look.