Alex Gibson comments on [Paper] Stochastic Parameter Decomposition

Alex Gibson 30 Jun 2025 6:22 UTC
3 points
0
As a more productive question, say we had an LLM, which, amongst other things, if there is a known bigram encoded in the residual stream of the form $t_{1} A + t_{2} B$ (corresponding to known bigram $t_{1} t_{2}$ ), potentially with interference from other aspects, outputs a consistent vector $v_{t_{1} t_{2}}$ into the residual stream from an MLP layer. This is how GPT-2 encodes known bigrams, hence the relevance.
And say that there are quadratically many known bigrams as a function of hidden neuron size, so that in particular there are more bigrams than residual stream dimension. As far as I know, an appropriately randomly initialized network should be able to accomplish this task (or at least with $W_{i n}$ random)
Is the goal for SPD to learn components for $W_{i n}$ such that any given component only fires non-negligibly on a single bigram? Or is it ok if components fire on multiple different bigrams? I am trying to reason through how SPD would act in this case.
- Lucius Bushnaq 30 Jun 2025 7:03 UTC
  4 points
  0
  Parent
  I’d have to think about the exact setup here to make sure there’s no weird caveats, but my first thought is that for $W_{in}$ , this ought to be one component per bigram, firing exclusively for that bigram.