Maybe a⊕b is represented “indicentally” because NN representations are high-dimensional with lots of stuff represented by chance
This would be my first guess, conditioned on the observation being real, except strike “by chance”. The model likely wants to form representations that can serve to solve a very wide class of prediction tasks over the data with very few non-linearities used, ideally none, as in a linear probe. That’s pretty much the hallmark of a good general representation you can use for many tasks.
I thus don’t think that comparing to a model with randomized weights is a good falsification. I wouldn’t expect a randomly initialized model to have nice general representations.
My stated hypothesis here would then predict that the linear probes for XOR features get progressively worse if you apply them to earlier layers. Because the model hasn’t had time to make the representation as general that early in the computation. So accuracy should start to drop as you look at layers before fourteen.
I’ll also say that if you can figure out a pattern in how particular directions get used as components for many different boolean classification tasks, that seems like the kind of thing that might result in an increased understanding of what these directions encode exactly. What does the layer representation contain, in actual practice, that allows it to do this?
This would be my first guess, conditioned on the observation being real, except strike “by chance”. The model likely wants to form representations that can serve to solve a very wide class of prediction tasks over the data with very few non-linearities used, ideally none, as in a linear probe. That’s pretty much the hallmark of a good general representation you can use for many tasks.
I thus don’t think that comparing to a model with randomized weights is a good falsification. I wouldn’t expect a randomly initialized model to have nice general representations.
My stated hypothesis here would then predict that the linear probes for XOR features get progressively worse if you apply them to earlier layers. Because the model hasn’t had time to make the representation as general that early in the computation. So accuracy should start to drop as you look at layers before fourteen.
I’ll also say that if you can figure out a pattern in how particular directions get used as components for many different boolean classification tasks, that seems like the kind of thing that might result in an increased understanding of what these directions encode exactly. What does the layer representation contain, in actual practice, that allows it to do this?