@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis.
In the same post though, we suggest that it’s unclear how much logit lens “works”, to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5.
But what seems plausible to me is that basically almost ~all of the information relevant to a feature are encoded in a very small amounts of directions, which are slightly different for each layer.
@beren in this post, we find that our method (Causal Direction Extraction) allows to capture a lot of the gender difference with 2 dimensions in a linearly separable way. Skimming that post might of interest to you and your hypothesis.
In the same post though, we suggest that it’s unclear how much logit lens “works”, to the extent that basically the direction encoding the best a same concept likely changes by a small angle at each layer, which causes two directions that best encode a concept at 15 layers of interval to have a cosine similarity <0.5.
But what seems plausible to me is that basically almost ~all of the information relevant to a feature are encoded in a very small amounts of directions, which are slightly different for each layer.