Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think.
Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).
I expect that you could nudge the model toward coming up with steganographic encodings that worked better. LLMs are mostly bad at spelling-related tasks because tokens. Maybe ask the encoding model to think about and list some steganographic approaches that it thinks it’s especially likely to be able to decode, prior to doing the task?
Maybe something like ‘think carefully step-by-step about what encodings you would be able to correctly decode if you saw them’ would help also?
‘This is extremely incredible and I have no explanation how this occurred’. Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’. That seems like a strong strategy to the extent that its logit distribution differs from the monitoring model’s. Maybe complicated by temperature, though? Seems like average number of false positives in decoding would increase steadily with temperature.
Out of curiosity, why Deepseek-v3 as opposed to eg Claude?
I vaguely think there’s been at least a bit of research on ‘just ask the model to do stego’ but could totally be wrong.
‘Introspection is one reason why Deepseek-v3 might understand its own encoding scheme better than Claude.’ By introspection you just mean ‘the fact that models are better at predicting themselves than predicting other models’ here, right? I think it would be clearer to just say that (I’m going to keep pushing back on ‘introspection’ as the term for that; I think we should reserve the term for uses that correspond more to the typical meaning of ‘introspection’).
Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I’m pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
why Deepseek-v3 as opposed to eg Claude?
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
LLMs do seem to be pretty good at picking self-consistent Schelling points, at least in simple cases—I’ve got a writeup here of some casual experiments I did with GPT-4 last January on picking various Schelling points, eg a date, a number, a word (also some discussion of that in the MATS slack).
this still seems like a pretty incredible claim
I think it seems somewhat less surprising to me (maybe because of the Schelling point experiments), but I certainly wouldn’t have been confident that it would do this well.
PS—I’m loving the frequent shortform posts, I hope you continue! I try to do something somewhat similar with my research diary, but usually no one reads that and it’s certainly not daily. I’m tempted to try doing the same thing :)
I wonder if r1 could reliably make the claim that the text had stego-text without being told—i.e. give it a lineup of ten pieces of short creative writing that another instance generated, 1 has steg. See if r1 can pick out the steg, then if various monitors (Sonnet, o1) can also pick it out
What is the threat model you are probing for here?
The prompting approach suffers from LLMs’ shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.
> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ’recovering the plaintext message’
While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.
> Both of these [steganography inducing methods] involve finetuning the model
Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
Can frontier language models engage in collusion for steganography? Here is a write-up of a preliminary result along these lines, showing that Deepseek-v3 may be able to collude with other instances of itself to do steganography. And also that this steganography might be more subtle than we think.
Epistemic status; highly uncertain (and I’m sorry if this ends up being overclaiming, but I’m very excited at the moment).
A few thoughts:
I expect that you could nudge the model toward coming up with steganographic encodings that worked better. LLMs are mostly bad at spelling-related tasks because tokens. Maybe ask the encoding model to think about and list some steganographic approaches that it thinks it’s especially likely to be able to decode, prior to doing the task?
Maybe something like ‘think carefully step-by-step about what encodings you would be able to correctly decode if you saw them’ would help also?
‘This is extremely incredible and I have no explanation how this occurred’. Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’. That seems like a strong strategy to the extent that its logit distribution differs from the monitoring model’s. Maybe complicated by temperature, though? Seems like average number of false positives in decoding would increase steadily with temperature.
Out of curiosity, why Deepseek-v3 as opposed to eg Claude?
I vaguely think there’s been at least a bit of research on ‘just ask the model to do stego’ but could totally be wrong.
‘Introspection is one reason why Deepseek-v3 might understand its own encoding scheme better than Claude.’ By introspection you just mean ‘the fact that models are better at predicting themselves than predicting other models’ here, right? I think it would be clearer to just say that (I’m going to keep pushing back on ‘introspection’ as the term for that; I think we should reserve the term for uses that correspond more to the typical meaning of ‘introspection’).
Thanks egg, great thoughts!
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I’m pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
LLMs do seem to be pretty good at picking self-consistent Schelling points, at least in simple cases—I’ve got a writeup here of some casual experiments I did with GPT-4 last January on picking various Schelling points, eg a date, a number, a word (also some discussion of that in the MATS slack).
I think it seems somewhat less surprising to me (maybe because of the Schelling point experiments), but I certainly wouldn’t have been confident that it would do this well.
PS—I’m loving the frequent shortform posts, I hope you continue! I try to do something somewhat similar with my research diary, but usually no one reads that and it’s certainly not daily. I’m tempted to try doing the same thing :)
I wonder if r1 could reliably make the claim that the text had stego-text without being told—i.e. give it a lineup of ten pieces of short creative writing that another instance generated, 1 has steg. See if r1 can pick out the steg, then if various monitors (Sonnet, o1) can also pick it out
What is the threat model you are probing for here?
The prompting approach suffers from LLMs’ shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.
> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ’recovering the plaintext message’
While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.
> Both of these [steganography inducing methods] involve finetuning the model
Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
Agreed on the rest of points!
>As originally conceived, this is sort of like a “dangerous capability” eval for steg.
I am actually just about to start building something very similar to this for the AISI’s evals bounty program.
As of today (26 Jan) this isn’t reproducing for me. It’s possible that DeepSeek changed their model API to serve a distilled model.