If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
An intuition pump: Imagine the case of two scalar features c1,c2 being embedded along vectors f1,f2. If you consider a series that starts with f1,f2 being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out c1,c2 , until we hit f1=f2, at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of V−1 will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity O(1√1000), it’s very unlikely to be a big deal yet.
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
An intuition pump: Imagine the case of two scalar features c1,c2 being embedded along vectors f1,f2. If you consider a series that starts with f1,f2 being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out c1,c2 , until we hit f1=f2, at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of V−1 will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity O(1√1000), it’s very unlikely to be a big deal yet.
Yeah that makes sense, thanks.