Alex Gibson comments on [Paper] Stochastic Parameter Decomposition

Alex Gibson 29 Jun 2025 20:31 UTC
1 point
0
Being linearly independent is sufficient in this case to read off each x_i with zero interference. Rank-one matrices are equivalent to (linear functional) * vector, and so we just pick the dual basis as our linear functionals, and extend them to whole space.
- Lucius Bushnaq 29 Jun 2025 20:57 UTC
  2 points
  0
  Parent
  If you have 100 orthogonal linear probes to read with, yes. But since there’s only 50 neurons, the actual circuits for different input features in the network will have interference to deal with.
  - Alex Gibson 29 Jun 2025 21:04 UTC
    1 point
    0
    Parent
    Edit: To be clear, of course the network itself cannot perform the task precisely. I’m simply claiming that you can precisely mimic the behaviour of $W_{i n}$ with 100 rank-1 components, by just reading off the basis, as SPD does in this case. The fact that the $v_{i}$ themselves are not orthogonal is irrelevant.
    To be concrete: if we have 100 linearly independent vectors $v_{i}$ , we can extend this to a basis of the whole 1000-dimensional space. Let $V$ be the change of basis matrix from the standard basis to this basis. Then we can write $W_{i n}$ = $(W_{i n} V^{- 1}) V$ , where we can pick $(W_{i n} V^{- 1})$ arbitrarily.
    If we write $W_{i n} V^{- 1}$ as a sum of rank-1 matrices $W_{c}$ , then $W_{c} V$ will sum to $W_{i n}$ , and $W_{c} V$ is still rank-one since the image of $W_{c} V$ = image of $W_{c}$
    So we can assume wlog that our $v_{i}$ lie along the standard basis, i.e: that they are orthogonal with respect to standard inner product.
    - Alex Gibson 30 Jun 2025 6:22 UTC
      3 points
      0
      Parent
      As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form $t_{1} A + t_{2} B$ (corresponding to known bigram $t_{1} t_{2}$ ), potentially with interference from other aspects, outputs a consistent vector $v_{t_{1} t_{2}}$ into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
      And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with $W_{i n}$ random)
      Is the goal for SPD to learn components for $W_{i n}$ such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
      - Lucius Bushnaq 30 Jun 2025 7:03 UTC
        4 points
        0
        Parent
        I’d have to think about the exact setup here to make sure there’s no weird caveats, but my first thought is that for $W_{in}$ , this ought to be one component per bigram, firing exclusively for that bigram.
    - Lucius Bushnaq 30 Jun 2025 6:17 UTC
      2 points
      0
      Parent
      Sure, yes, that’s right. But I still wouldn’t take this to be equivalent to our $v_{i}$ literally being orthogonal, because the trained network itself might not perfectly learn this transformation.
      - Alex Gibson 30 Jun 2025 6:26 UTC
        1 point
        0
        Parent
        If there is some internal gradient descent reason for it being easier to learn to read off orthogonal vectors then I take it back. I feel like I am being too pedantic here, in any case.
        Lucius Bushnaq 30 Jun 2025 6:59 UTC
        3 points
        0
        Parent
        An intuition pump: Imagine the case of two scalar features $c_{1}, c_{2}$ being embedded along vectors $f_{1}, f_{2}$ . If you consider a series that starts with $f_{1}, f_{2}$ being orthogonal, then gives them ever higher cosine similarity, I’d expect the network to have ever more trouble learning to read out $c_{1}$ , $c_{2}$ , until we hit $f_{1} = f_{2}$ , at which point the network definitely cannot learn to read the features out at all. I don’t know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.
        Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of $V^{- 1}$ will be, with norm infinity in the limit of cosine similarity going to one.
        
        I agree that at cosine similarity $O (\frac{1}{\sqrt{1000}})$ , it’s very unlikely to be a big deal yet.