Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the “background”:
Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.
While it’s true that you can’t have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality. If the representations of tokens are homogeneously spread out, their “overlap” (or the projection on one on another) would be extremely small. So I don’t think what you are getting is a limitation due to the embedding space dimensionality; rather, it may be due to how training ends up distributing the representations in the d-dimensional space. So it would be good (as suggested in Fabien Roger’s comment) to empirically compute the cosine similarity of the ‘owl’ and ‘o87’ tokens. [It may be that there are some analysing the geometry of LLM representation space, though I am not familiar with this field]
Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the “background”:
While it’s true that you can’t have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality. If the representations of tokens are homogeneously spread out, their “overlap” (or the projection on one on another) would be extremely small. So I don’t think what you are getting is a limitation due to the embedding space dimensionality; rather, it may be due to how training ends up distributing the representations in the d-dimensional space. So it would be good (as suggested in Fabien Roger’s comment) to empirically compute the cosine similarity of the ‘owl’ and ‘o87’ tokens. [It may be that there are some analysing the geometry of LLM representation space, though I am not familiar with this field]