I don’t understand what you mean by “you can always scale the input instead”. And as input to the MLP is the LayerNorm of the residual up that point the magnitude of the input is always the same.
I might be misremembering the GPT2 architecture, but I thought the output of the MLP layer was something like WO⋅ReLU(Win⋅LN(xresidual))? So you can just scale Win up when you scale WOdown. (Assuming I’m remembering the architecture correctly,) if you’re concerned about the scale of the output, then I think it makes sense to look at the scale of Win as well.
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale Win if we are concerned about the scale of Wout is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
I do agree that looking at WO alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
At first blush, the thing you said is exactly right; scaling Win up and scale WO down will leave the implemented function unchanged.
However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see ∥Win∥=∥WO∥, since that minimizes the regularization penalty.
However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale Win and WO without changing the implemented function.
Thanks for clarifying the scales!
I might be misremembering the GPT2 architecture, but I thought the output of the MLP layer was something like WO⋅ReLU(Win⋅LN(xresidual))? So you can just scale Win up when you scale WOdown. (Assuming I’m remembering the architecture correctly,) if you’re concerned about the scale of the output, then I think it makes sense to look at the scale of Win as well.
We took dot product over cosine similarity because the dot product is the neuron’s effect on the logits (since we use the dot product of the residual stream and embedding matrix when unembedding).
I think your point on using the scale Win if we are concerned about the scale of Wout is fair — we didn’t really look at how the rest of the network interacted with this neuron through its input weights, but perhaps a input-scaled congruence score (e.g. output congruence * average of squared input weights) could give us a better representation of a neuron’s relevance for a token.
I do agree that looking at WO alone seems a bit misguided (unless we’re normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:
At first blush, the thing you said is exactly right; scaling Win up and scale WO down will leave the implemented function unchanged.
However, this’ll affect the L2 regularization penalty. All else equal, we’d expect to see ∥Win∥=∥WO∥, since that minimizes the regularization penalty.
However, this is all complicated by the fact that you can also alternatively scale the LayerNorm’s gain parameter, which (I think) isn’t regularized.
Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale Win and WO without changing the implemented function.