Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code.
I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. ‘untrusted monitoring’)
Appendix
There is evidence that models can identify data they have generated, vs data generated by other models.
LLMs prefer their own data when used as a judge: [1], [2]
We also know of some specific schemes to send 1 bit of information (e.g. “is this generated by myself”) that are very subtle.
Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
Paraphrasing (such that exact “I generated this text” doesn’t work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
Using a different fine-tune for the untrusted monitor so it doesn’t have the exact same probability distribution.
Is this like people preferring to hear things spoken in their dialect? Like, there is no intention to conspire, it’s just when the words work exactly the way you would use them, it simply sounds better.
Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences
“Self-recognition” as a backdoor.
Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code.
I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. ‘untrusted monitoring’)
Appendix
There is evidence that models can identify data they have generated, vs data generated by other models.
LLMs prefer their own data when used as a judge: [1], [2]
We also know of some specific schemes to send 1 bit of information (e.g. “is this generated by myself”) that are very subtle.
Red-green list watermarking
Minimum-entropy coupling
Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
Paraphrasing (such that exact “I generated this text” doesn’t work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
Using a different fine-tune for the untrusted monitor so it doesn’t have the exact same probability distribution.
We discuss in more detail here.
Is this like people preferring to hear things spoken in their dialect? Like, there is no intention to conspire, it’s just when the words work exactly the way you would use them, it simply sounds better.
Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences