LawrenceC comments on We Found An Neuron in GPT-2

LawrenceC 12 Feb 2023 5:41 UTC
3 points
0
Thanks for clarifying the scales!
I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
I might be misremembering the GPT2 architecture, but I thought the output of the MLP layer was something like $W_{O} \cdot ReLU (W_{i n} \cdot L N (x_{r e s i d u a l}))$ ? So you can just scale $W_{i n}$ up when you scale $W_{O}$ down. (Assuming I’m remembering the architecture correctly,) if you’re concerned about the scale of the output, then I think it makes sense to look at the scale of $W_{i n}$ as well.
- Clement Neo 12 Feb 2023 10:57 UTC
  6 points
  1
  Parent
  We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
  
  I think your point on using the scale $W_{i n}$ if we are concerned about the scale of $W_{o u t}$ is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
- Aryan Bhatt 8 May 2023 19:18 UTC
  1 point
  0
  Parent
  I do agree that looking at $W_{O}$ alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
  - At first blush, the thing you said is exactly right; scaling $W_{i n}$ up and scale $W_{O}$ down will leave the implemented function unchanged.
  - However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see $∥ W_{i n} ∥ = ∥ W_{O} ∥$ , since that minimizes the regularization penalty.
  - However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
  - Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale $W_{i n}$ and $W_{O}$ without changing the implemented function.