Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our vi literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
An intuition pump: Imagine the case of two scalar features c1,c2 being embedded along vectors f1,f2. If you consider a series that starts with f1,f2 being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out c1,c2 , until we hit f1=f2, at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of V−1 will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity O(1√1000), it’s very unlikely to be a big deal yet.
Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our vi literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
An intuition pump: Imagine the case of two scalar features c1,c2 being embedded along vectors f1,f2. If you consider a series that starts with f1,f2 being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out c1,c2 , until we hit f1=f2, at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of V−1 will be, with norm infinity in the limit of cosine similarity going to one.
I agree that at cosine similarity O(1√1000), it’s very unlikely to be a big deal yet.
Yeah that makes sense, thanks.