Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 5 Feb 2025 20:12 UTC
6 points
0
“Self-recognition” as a backdoor.
Assume that models can recognise data they generate vs data they do not generate, with high fidelity. This could probably be used as a contextual trigger for backdoor behaviour, e.g. writing insecure code.
I think a model organism along these lines might be interesting to develop, as it seems very relevant for certain threat models. (c.f. ‘untrusted monitoring’)
Appendix
There is evidence that models can identify data they have generated, vs data generated by other models.
- LLMs prefer their own data when used as a judge: [1], [2]
We also know of some specific schemes to send 1 bit of information (e.g. “is this generated by myself”) that are very subtle.
- Red-green list watermarking
- Minimum-entropy coupling
- ryan_greenblatt 6 Feb 2025 1:16 UTC
  3 points
  0
  Parent
  Yes, it is relevant for untrusted monitoring. I think the basic version of this is pretty resolvable via:
  - Paraphrasing (such that exact “I generated this text” doesn’t work). Note that we just need to paraphase stuff sent to the monitor, not the actual action / reasoning.
  - Using a different fine-tune for the untrusted monitor so it doesn’t have the exact same probability distribution.
  We discuss in more detail here.
- Viliam 6 Feb 2025 9:18 UTC
  2 points
  0
  Parent
  LLMs prefer their own data when used as a judge
  Is this like people preferring to hear things spoken in their dialect? Like, there is no intention to conspire, it’s just when the words work exactly the way you would use them, it simply sounds better.
  - Daniel Tan 6 Feb 2025 9:23 UTC
    3 points
    0
    Parent
    Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences

Daniel Tan comments on Daniel Tan’s Shortform

Appendix