Lucius Bushnaq comments on [Paper] Stochastic Parameter Decomposition

Lucius Bushnaq 30 Jun 2025 6:17 UTC
2 points
0
Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our $v_{i}$ literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
- Alex Gibson 30 Jun 2025 6:26 UTC
  1 point
  0
  Parent
  If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
  - Lucius Bushnaq 30 Jun 2025 6:59 UTC
    3 points
    0
    Parent
    An intuition pump: Imagine the case of two scalar features $c_{1}, c_{2}$ being embedded along vectors $f_{1}, f_{2}$ . If you consider a series that starts with $f_{1}, f_{2}$ being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out $c_{1}$ , $c_{2}$ , until we hit $f_{1} = f_{2}$ , at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
    Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of $V^{- 1}$ will be, with norm infinity in the limit of cosine similarity going to one.
    
    I agree that at cosine similarity $O (\frac{1}{\sqrt{1000}})$ , it’s very unlikely to be a big deal yet.