Adam Newgas comments on What’s up with LLMs representing XORs of arbitrary features?

Adam Newgas 28 Mar 2025 10:23 UTC
3 points
0
I’ve come up with my own explanation for why this happens: https://www.lesswrong.com/posts/QpbdkECXAdLFThhGg/computational-superposition-in-a-toy-model-of-the-u-and#XOR_Circuits
In short, XOR representations are naturally learnt when a model is targetting some other boolean operation as the same circuitry makes all boolean operations linearly representable. But XOR requires different weights to identity operations, so linear probes still will tend to learn generalizable solutions.