Aryaman Arora

Karma: 11

Aryaman Arora 8 Nov 2022 14:54 UTC
LW: 3 AF: 1
0
AF
on: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Understand IOI in GPT-Neo: it’s a same size model but does IOI via composition of MLPs

GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.

Aryaman Arora 8 Nov 2022 19:58 UTC
1 point
0
AF
in reply to: Neel Nanda’s comment on: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that—some ideas to consider:
- the backup heads could have other main functions but incidentally are useful for the specific task we’re looking at, so they end up taking the place of the main heads
- thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren’t interpretable in big models due to superposition
Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer’s computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).

One weird thing I noticed with GPT-Neo 125M’s embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small’s 0.225.

On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven’t looked into this more myself so I don’t know how it compares to GPT-2. Just seems to be an overall profoundly strange model.

Aryaman Arora 8 Nov 2022 20:20 UTC
LW: 3 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
I’m pretty sure! I don’t think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. $softmax (W_{E} W_{U}) = softmax (W_{E} W_{E}^{T})$ ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product $x \cdot y = ∥ x ∥ ∥ y ∥ cos (θ)$ the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

Aryaman Arora 9 Nov 2022 0:32 UTC
1 point
0
AF
in reply to: Neel Nanda’s comment on: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo’s later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.

Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

Aryaman Arora 6 Feb 2023 0:25 UTC
6 points
0
on: SolidGoldMagikarp (plus, prompt generation)
I’ll just preregister that I bet these weird tokens have very large norms in the embedding space.

Aryaman Arora 29 Mar 2023 20:19 UTC
3 points
2
on: Some common confusion about induction heads
Really nice summarisation of the confusion. Re: your point 3, this point makes “induction heads” as a class of things feel a lot less coherent :( I had also not considered that the behaviour on random sequences to show induction as a fallback—do you think there may be induction-y heads that simply don’t activate on random sequences due to the out-of-distribution nature of them?