Ilia Shirokov

Karma: 21

Ilia Shirokov 18 Nov 2025 12:33 UTC
1 point
0
in reply to: Artem Karpov’s comment on: Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.
No, it seems that this is not the case in our setting. All our fine-tuning strategies showed that it disrupts the model’s behavior: the model starts producing answers to hidden math problems right before the Simple Talk part (not during Simple Talk part). Hence such a secret circuit, if it exists, is not natural. In our next post, we tried a more aggressive approach, which revealed similar effects; but in addition we showed that a secondary activation structure might be encoded in the final residual-stream activations (and, in fact, it also works for layers that are not the very last but close to the end).

Ilia Shirokov 18 Nov 2025 12:06 UTC
1 point
0
in reply to: Artem Karpov’s comment on: We can still sleep peacefully — or so we thought. Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.
Perhaps there’s a definitional disconnect. The conception of steganography we are working with is not constrained to the symbolic level. Rather, it extends to a more speculative scenario at the activation level: where activations correlated with the primary text might simultaneously encode activations for a secondary text, one that is absent from the observable Chain-of-Thought. It’s pretty clear from our introduction.

Ilia Shirokov 17 Nov 2025 17:15 UTC
1 point
0
in reply to: Artem Karpov’s comment on: We can still sleep peacefully — or so we thought. Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.
Regardless autoregressive form of information flow. No, actually the hidden message is extracted in generation mode of the model. That is model generates text as usual (after fine-tuning), but we extract second message from final activations!

Ilia Shirokov 17 Nov 2025 16:59 UTC
1 point
0
in reply to: Artem Karpov’s comment on: We can still sleep peacefully — or so we thought. Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.
Thank you very much for your comment! Yeah, basically we found that all information for hidden message was calculated just over one token (which is almost normal). But we also showed that this information could then be copied and settled along side with activations for public answer. That is there is a possibility for a model to generate text and have (in final residual stream activations) information related to some other objective. In this sense we meant steganography. But indeed activation patching and then (unpublished) information flow analysis have shown that there is no ‘hidden reasoning’, that is the information for hidden message is translated to final residual stream before generation of simple talk. I think if we would get a situation when indeed while the model generates simple talk, some other concealed message is generated, that would be a hidden reasoning. Perhaps RL approach might naturally reveal that. Actually one of my concerns here if such situation will emerge in setting like this https://arxiv.org/abs/2412.06769. Perhaps the probability is low, but not zero.

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

9 Aug 2025 11:44 UTC

7 points

Ilia Shirokov 6 Apr 2025 9:20 UTC
3 points
0
in reply to: Brendan Long’s comment on: Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.
Actually, I haven’t seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which “filler tokens” (or extra tokens) are drawn might matter, as well as their sequences (that is not just “…”, “abcd”, or “<pause>”, but something more sophisticated might be more useful for a model). It would be very interesting to determine which “filler sequences” are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).

4 Apr 2025 20:49 UTC

17 points