Noa Nabeshima

Karma: 61

Noa Nabeshima 23 Sep 2022 21:53 UTC
LW: 13 AF: 7
4
AF
in reply to: Neel Nanda’s comment on: Interpreting Neural Networks through the Polytope Lens
I think at least some GPT2 models have a really high-magnitude direction in their residual stream that might be used to preserve some scale information after LayerNorm. [I think Adam Scherlis originally mentioned or showed the direction to me, but maybe someone else?]. It’s maybe akin to the water-droplet artifacts in StyleGAN touched on here: https://arxiv.org/pdf/1912.04958.pdf
We begin by observing that most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets. As shown in Figure 1, even when the droplet may not be obvious in the final image, it is present in the intermediate feature maps of the generator.1 The anomaly starts to appear around 64×64 resolution, is present in all feature maps, and becomes progressively stronger at higher resolutions. The existence of such a consistent artifact is puzzling, as the discriminator should be able to detect it. We pinpoint the problem to the AdaIN operation that normalizes the mean and variance of each feature map separately, thereby potentially destroying any information found in the magnitudes of the features relative to each other. We hypothesize that the droplet artifact is a result of the generator intentionally sneaking signal strength information past instance normalization: by creating a strong, localized spike that dominates the statistics, the generator can effectively scale the signal as it likes elsewhere. Our hypothesis is supported by the finding that when the normalization step is removed from the generator, as detailed below, the droplet artifacts disappear completely.
What links here?
- cherrvak's comment on Basic Facts about Language Model Internals by beren (23 Feb 2023 19:58 UTC; 1 point)

Noa Nabeshima 15 Jan 2024 22:52 UTC
11 points
2
on: Sparse MLP Distillation
I’ve trained some sparse MLPs with 20K neurons on a 4L TinyStories model with ReLU activations and no layernorm and I took a look at them after reading this post. For varying integer $S$ , I applied an L1 penalty of $2^{S}$ on the average of the activations per token, which seems pretty close to doing an L1 of $2^{S} / 20, 000$ on the sum of the activations per token. Your L1 of $2 \times 10^{- 4}$ with 12K neurons is sort of like $S = 2$ in my setup. After reading your post, I checked out the cosine similarity between encoder/decoder of original mlp neurons and sparse mlp neurons for varying values of $S$ (make sure to scroll down once you click one of the links!):

S=3
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp3
S=4
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp4

S=5
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp5
S=6
https://plotly-figs.s3.amazonaws.com/sparse_mlp_L1_2exp6
I think the behavior you’re pointing at is clearly there at lower L1s on layers other than layer 0 (? what’s up with that?) and sort of decreases with higher L1 values, to the point that the behavior is there a bit at S=5 and almost not there at S=6. I think the non-dead sparse neurons are almost all interpretable at S=5 and S=6.

Original val loss of model: 1.128 ~= 1.13.
Zero ablation of MLP loss values per layer: [3.72, 1.84, 1.56, 2.07].

S=6 loss recovered per layer

Layer 0: 1-(1.24-1.13)/(3.72-1.13): 96% of loss recovered
Layer 1: 1-(1.18-1.13)/(1.84-1.13): 93% of loss recovered
Layer 2: 1-(1.21-1.13)/(1.56-1.13): 81% of loss recovered
Layer 3: 1-(1.26-1.13)/(2.07-1.13): 86% of loss recovered

Compare to 79% of loss-recovered from Anthropic’s A/1 autoencoder with 4K features and a pretty different setup.

(Also, I was going to focus on S=5 MLPs for layers 1 and 2, but now I think I might instead stick with S=6. This is a little tricky because I wouldn’t be surprised if tiny-stories MLP neurons are interpretable at higher rates than other models.)

Basically I think sparse MLPs aren’t a dead end and that you probably just want a higher L1.

Noa Nabeshima 16 May 2022 21:48 UTC
6 points
in reply to: Aiyen’s comment on: Deepmind’s Gato: Generalist Agent
No, I’m pretty confident every expert is a neural network policy trained on the task. See “F. Data Collection Details” and the second paragraph of “3.3. Robotics—RGB Stacking Benchmark (real and sim)”

Noa Nabeshima 13 Dec 2022 6:39 UTC
LW: 5 AF: 3
0
AF
on: An exploration of GPT-2′s embedding weights
I think this post is great and I’m really happy that it’s published.

Noa Nabeshima 2 Jul 2023 18:59 UTC
2 points
0
on: faster latent diffusion
seems related to: https://arxiv.org/pdf/2303.01469.pdf

Noa Nabeshima 25 Jan 2021 21:27 UTC
LW: 2 AF: 2
AF
on: Clarifying some key hypotheses in AI alignment
What software did you use to produce this diagram?

Noa Nabeshima 2 Mar 2024 1:58 UTC
LW: 1 AF: 1
0
AF
on: Sparse Autoencoders Work on Attention Layer Outputs
I wonder if multiple heads having the same activation pattern in a context is related to the limited rank per head; once the VO subspace of each head is saturated with meaningful directions/features maybe the model uses multiple heads to write out features that can’t be placed in the subspace of any one head.

Noa Nabeshima 18 Jan 2024 1:41 UTC
1 point
in reply to: Richard Korzekwa ’s comment on: Why indoor lighting is hard to get right and how to fix it
Do you have any updates on this? I’m interested in this.

Noa Nabeshima 21 Dec 2023 7:31 UTC
1 point
0
on: My best guess at the important tricks for training 1L SAEs
This link is a broken

Noa Nabeshima 21 Dec 2023 7:28 UTC
1 point
6
on: My best guess at the important tricks for training 1L SAEs
This is great.

Noa Nabeshima 13 Jul 2023 17:08 UTC
LW: 1 AF: 1
0
AF
on: Really Strong Features Found in Residual Stream
[word] and [word]
can be thought of as “the previous token is ′ and’.”

It might just be one of a family of linear features or ?? aspect of some other representation ?? corresponding to what the previous token is, to be used for at least induction head.

Maybe the reason you found ′ and’ first is because ′ and’ is an especially frequent word. If you train on the normal document distribution, you’ll find the most frequent features first.

Noa Nabeshima 3 Jul 2023 0:19 UTC
1 point
0
in reply to: bhauth’s comment on: faster latent diffusion
Consistency models are trained from scratch in the paper in addition to distilled from diffusion models. I think it’ll probably just work with text-conditioned generation, but unclear to me w/o much thought how to do the equivalent of classifier-free guidance.

Noa Nabeshima 29 Nov 2022 21:14 UTC
LW: 1 AF: 1
0
AF
on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
I really appreciate this work!

I wonder if the reason MLPs are more polysemantic isn’t because there are fewer MLPs than heads but because the MLP matrices are larger—

Suppose the model is storing information as sparse [rays or directions]. Then SVD on large matrices like the token embeddings can misunderstand the model in different ways:

- Many of the sparse rays/directions won’t be picked up by SVD. If there are 10,000 rays/directions used by the model and the model dimension is 768, SVD can only pick 768 directions.
- If the model natively stores information as rays, then SVD is looking for the wrong thing: directions instead of rays. If you think of SVD as a greedy search for the most important directions, the error might increase as the importance of the direction decreases.
- Because the model is storing things sparsely, it can squeeze in far more meaningful directions than the model dimension. But these directions can’t be perfectly orthogonal, they have to interfere with each other at least a bit. This noise could make SVD with large matrices worse and also means that the assumptions involved in SVD are wrong.

As evidence for the above story, I notice that the earliest PCA directions on the token embeddings are interpretable, but they quickly become less interpretable?

Maybe because the QK/OV matrices have low rank they specialize in a small number of the sparse directions (possibly greater than their rank) and have less interference noise. These could contribute to interpretability of SVD directions.

You might expect in this world that the QK/OV SVD directions would be more interpretable than the MLP matrices which would in turn be more interpretable than the token embedding SVD.