Rohin Shah comments on What’s up with LLMs representing XORs of arbitrary features?

Rohin Shah 12 Jan 2024 17:20 UTC
LW: 4 AF: 3
0
AF
I think you mostly need to hope that it doesn’t matter (because the crazy XOR directions aren’t too salient) or come up with some new idea.
Yeah certainly I’d expect the crazy XOR directions aren’t too salient.
I’ll note that if it ends up these XOR directions don’t matter for generalization in practice, then I start to feel better about CCS (along with other linear probing techniques). I know that for CCS you’re more worried about issues around correlations with features like true_according_to_Alice, but my feeling is that we might be able to handle spurious features that are that crazy and numerous, but not spurious features as crazy and numerous as these XORs.
Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
I’m not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don’t see a great reason to expect unsupervised methods to work better.
If I had to articulate my reason for being surprised here, it’d be something like:
1. I didn’t expect LLMs to compute many XORs incidentally
2. I didn’t expect LLMs to compute many XORs because they are useful
but lots of XORs seem to get computed anyway.
This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don’t yet understand. So I shouldn’t feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn’t seem like an important point.
- Sam Marks 12 Jan 2024 19:07 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Imo “true according to Alice” is nowhere near as “crazy” a feature as “has_true XOR has_banana”. It seems useful for the LLM to model what is true according to Alice! (Possibly I’m misunderstanding what you mean by “crazy” here.)
  I agree with this! (And it’s what I was trying to say; sorry if I was unclear.) My point is that
  { features which are as crazy as “true according to Alice” (i.e., not too crazy)}
  seems potentially manageable, where as
  { features which are as crazy as arbitrary boolean functions of other features }
  seems totally unmanageable.
  Thanks, as always, for the thoughtful replies.