(This is cross-posted from my blog at https://grgv.xyz/blog/neurons1/. I’m looking for feedback: does it makes sense at all, and if there is any novelty. Also, if the folloup questions/directions make sense)
While applying logit attribution analysis to transformer outputs, I have noticed that in many cases the generated token can be attributed to the output of a single neuron.
One way to analyze neurons activations is to collect activations from a dataset of text snippets, like in “Exploring Llama-3-8B MLP Neurons” [1]. This does show that some of the neurons are strongly activated by a specific token from the model’s vocabulary, for example see the “Android” neuron: https://neuralblog.github.io/llama3-neurons/neuron_viewer.html#0,2
Another way to analyze neurons is to apply logit lens to the MLP weights, similar to “Analyzing Transformers in Embedding Space” [2], where model parameters are projected into the embedding space for interpretation.
Projecting neurons into vocabulary space
Let’s apply logit lens to a sample of MLP output weights for layer 13 of Llama-3.2-1B:
This plot is not very informative. Let’s look at the the distribution:
The distribution is non-symmetric: there is a long tail of neurons that are close to vocabulary tokens.
Sorting the neurons my max dot product highlights the distribution even better: there is a significant number of neurons with outputs that are aligned with vocabulary embedding.
Extending to other layers
This visualization can be repeated for MLPs in all other layers. Looking at all the distributions, majority of neurons that are stronger aligned with the vocabulary are in the later blocks:
It’s easer to see the difference with separate plots:
In summary, strong vocabulary alignment is clearly visible in a subset of neurons – especially in later layers. This opens up several follow-up questions:
Do neurons that are close to a vocabulary embedding represent only one specific token, or are they representing a more abstract concept that just happens to be near a token’s embedding? Does small distance to the vocabulary correlate with monosemanticity?
What is the functional role of stronger vocabulary alignment? Are these neurons a mechanism for translating concepts from the model’s internal representation back into token space, or are there some other roles?
What is the coverage of this representation? Do all important tokens have a corresponding “vocabulary neuron”, or is this specialization reserved for a subset only? Why?
Exploring vocabulary alignment of neurons in Llama-3.2-1B
Link post
(This is cross-posted from my blog at https://grgv.xyz/blog/neurons1/. I’m looking for feedback: does it makes sense at all, and if there is any novelty. Also, if the folloup questions/directions make sense)
While applying logit attribution analysis to transformer outputs, I have noticed that in many cases the generated token can be attributed to the output of a single neuron.
One way to analyze neurons activations is to collect activations from a dataset of text snippets, like in “Exploring Llama-3-8B MLP Neurons” [1]. This does show that some of the neurons are strongly activated by a specific token from the model’s vocabulary, for example see the “Android” neuron: https://neuralblog.github.io/llama3-neurons/neuron_viewer.html#0,2
Another way to analyze neurons is to apply logit lens to the MLP weights, similar to “Analyzing Transformers in Embedding Space” [2], where model parameters are projected into the embedding space for interpretation.
Projecting neurons into vocabulary space
Let’s apply logit lens to a sample of MLP output weights for layer 13 of Llama-3.2-1B:
It’s easy to spot a pattern – some neurons are more closely aligned to a cluster of semantically-similar tokens, like:
Other neurons are much more random in terms of the proximity to vocabulary embeddings, equally dis-similar to various unrelated tokens:
Quantifying vocabulary alignment
Minimal distance (max dot product) to the embedding of a vocabulary token looks like a good measure of how vocabulary-aligned the neuron is.
In the previous example, this is the first number in each row:
Plotting this values for all neurons of layer 13:
This plot is not very informative. Let’s look at the the distribution:
The distribution is non-symmetric: there is a long tail of neurons that are close to vocabulary tokens.
Sorting the neurons my max dot product highlights the distribution even better: there is a significant number of neurons with outputs that are aligned with vocabulary embedding.
Extending to other layers
This visualization can be repeated for MLPs in all other layers. Looking at all the distributions, majority of neurons that are stronger aligned with the vocabulary are in the later blocks:
It’s easer to see the difference with separate plots:
In summary, strong vocabulary alignment is clearly visible in a subset of neurons – especially in later layers. This opens up several follow-up questions:
Do neurons that are close to a vocabulary embedding represent only one specific token, or are they representing a more abstract concept that just happens to be near a token’s embedding? Does small distance to the vocabulary correlate with monosemanticity?
What is the functional role of stronger vocabulary alignment? Are these neurons a mechanism for translating concepts from the model’s internal representation back into token space, or are there some other roles?
What is the coverage of this representation? Do all important tokens have a corresponding “vocabulary neuron”, or is this specialization reserved for a subset only? Why?
Code
The notebook with the code is on github: https://github.com/coolvision/interp/blob/main/LLaMA_jun_4_2025_neurons.ipynb
References
Nguyễn, Thông. 2024. “Llama-3-8B MLP Neurons.” https://neuralblog.github.io/llama3-neurons.
Dar, G., Geva, M., Gupta, A., and Berant, J. 2022. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535.
nostalgebraist. 2020. interpreting GPT: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens