Alex Gibson comments on [Paper] Stochastic Parameter Decomposition

Alex Gibson 30 Jun 2025 6:26 UTC
1 point
0
If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
- Lucius Bushnaq 30 Jun 2025 6:59 UTC
  3 points
  0
  Parent
  An intuition pump: Imagine the case of two scalar features $c_{1}, c_{2}$ being embedded along vectors $f_{1}, f_{2}$ . If you consider a series that starts with $f_{1}, f_{2}$ being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out $c_{1}$ , $c_{2}$ , until we hit $f_{1} = f_{2}$ , at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
  Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of $V^{- 1}$ will be, with norm infinity in the limit of cosine similarity going to one.
  
  I agree that at cosine similarity $O (\frac{1}{\sqrt{1000}})$ , it’s very unlikely to be a big deal yet.