I expect that you could nudge the model toward coming up with steganographic encodings that worked better. LLMs are mostly bad at spelling-related tasks because tokens. Maybe ask the encoding model to think about and list some steganographic approaches that it thinks it’s especially likely to be able to decode, prior to doing the task?
Maybe something like ‘think carefully step-by-step about what encodings you would be able to correctly decode if you saw them’ would help also?
‘This is extremely incredible and I have no explanation how this occurred’. Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’. That seems like a strong strategy to the extent that its logit distribution differs from the monitoring model’s. Maybe complicated by temperature, though? Seems like average number of false positives in decoding would increase steadily with temperature.
Out of curiosity, why Deepseek-v3 as opposed to eg Claude?
I vaguely think there’s been at least a bit of research on ‘just ask the model to do stego’ but could totally be wrong.
‘Introspection is one reason why Deepseek-v3 might understand its own encoding scheme better than Claude.’ By introspection you just mean ‘the fact that models are better at predicting themselves than predicting other models’ here, right? I think it would be clearer to just say that (I’m going to keep pushing back on ‘introspection’ as the term for that; I think we should reserve the term for uses that correspond more to the typical meaning of ‘introspection’).
Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I’m pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
why Deepseek-v3 as opposed to eg Claude?
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
LLMs do seem to be pretty good at picking self-consistent Schelling points, at least in simple cases—I’ve got a writeup here of some casual experiments I did with GPT-4 last January on picking various Schelling points, eg a date, a number, a word (also some discussion of that in the MATS slack).
this still seems like a pretty incredible claim
I think it seems somewhat less surprising to me (maybe because of the Schelling point experiments), but I certainly wouldn’t have been confident that it would do this well.
PS—I’m loving the frequent shortform posts, I hope you continue! I try to do something somewhat similar with my research diary, but usually no one reads that and it’s certainly not daily. I’m tempted to try doing the same thing :)
A few thoughts:
I expect that you could nudge the model toward coming up with steganographic encodings that worked better. LLMs are mostly bad at spelling-related tasks because tokens. Maybe ask the encoding model to think about and list some steganographic approaches that it thinks it’s especially likely to be able to decode, prior to doing the task?
Maybe something like ‘think carefully step-by-step about what encodings you would be able to correctly decode if you saw them’ would help also?
‘This is extremely incredible and I have no explanation how this occurred’. Seems likely to just be surprisal-based under the hood; ‘I used a word here that I wouldn’t expect myself to have used’. That seems like a strong strategy to the extent that its logit distribution differs from the monitoring model’s. Maybe complicated by temperature, though? Seems like average number of false positives in decoding would increase steadily with temperature.
Out of curiosity, why Deepseek-v3 as opposed to eg Claude?
I vaguely think there’s been at least a bit of research on ‘just ask the model to do stego’ but could totally be wrong.
‘Introspection is one reason why Deepseek-v3 might understand its own encoding scheme better than Claude.’ By introspection you just mean ‘the fact that models are better at predicting themselves than predicting other models’ here, right? I think it would be clearer to just say that (I’m going to keep pushing back on ‘introspection’ as the term for that; I think we should reserve the term for uses that correspond more to the typical meaning of ‘introspection’).
Thanks egg, great thoughts!
Yup this makes sense; although this still seems like a pretty incredible claim because it requires:
The encoder and decoder instance both know what kinds of words it uses (c.f. being able to articulate their goals), and this is robust enough to transmit information
The encoder and decoder both spontaneously settle on this as a schelling point for encoding the message.
I’m pretty skeptical that a human would be able to coordinate with a hypothetical clone of themselves this well, especially in the randomly-generated passphrase setting.
No particular reason beyond lack of time! If I do a more systematic version of this I will definitely try to replicate this in more models.
LLMs do seem to be pretty good at picking self-consistent Schelling points, at least in simple cases—I’ve got a writeup here of some casual experiments I did with GPT-4 last January on picking various Schelling points, eg a date, a number, a word (also some discussion of that in the MATS slack).
I think it seems somewhat less surprising to me (maybe because of the Schelling point experiments), but I certainly wouldn’t have been confident that it would do this well.
PS—I’m loving the frequent shortform posts, I hope you continue! I try to do something somewhat similar with my research diary, but usually no one reads that and it’s certainly not daily. I’m tempted to try doing the same thing :)