# Neel Nanda

Karma: 3,378
• Fyi, the final past upvotes link is to 2020, not 2021

• Huh, I had the mildly surprising (and depressing) experience of reading through all the posts with >100 karma in 2021, and observing that I just didn’t feel excited about the vast majority of them in hindsight. Solid data!

• 1 Dec 2022 22:43 UTC
LW: 5 AF: 2
4 ∶ 0
AF

Before we even start a training run, we should try to have *actually good *abstract arguments about alignment properties of the AI. Interpretability work is easier if you’re just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.

Thanks for the post! I particularly appreciated this point

• 1 Dec 2022 12:28 UTC
LW: 7 AF: 4
0 ∶ 0
AF

Thanks a lot for writing up this post! This felt much clearer and more compelling to me than the earlier versions I’d heard, and I broadly buy that this is a lot of what was going on with the phase transitions in my grokking work.

The algebra in the rank-1 learning section was pretty dense and not how I would have phrased it, so here’s my attempt to put it in my own language:

We want to fit to some fixed rank 1 matrix , with two learned vectors , forming . Our objective function is . Rank one matrix facts - and .

So our loss function is now . So what’s the derivative with respect to x? This is the same question as “what’s the best linear approximation to how does this function change when ”. Here we can just directly read this off as

The second term is an exponential decay term, assuming the size of y is constant (in practice this is probably a good enough assumption). The first term is the actual signal, moving along the correct direction, but is proportional to how well the other part is doing, which starts bad and then increases, creating the self-reinforcing properties that make it initially start slow then increase.

Another rephrasing—x consists of a component in the correct direction (a), and the rest of x is irrelevant. Ditto y. The components in the correct directions reinforce each other, and all components experience exponential-ish decay, because MSE loss wants everything not actively contributing to be small. At the start, the irrelevant components are way bigger (because they’re in the rank 99 orthogonal subspace to a), and they rapidly decay, while the correct component slowly grows. This is a slight decrease in loss, but mostly a plateau. Then once the irrelevant component is small and the correct component has gotten bigger, the correct signal dominates. Eventually, the exponential decay is strong enough in the correct direction to balance out the incentive for future growth.

Generalising to higher dimensional subspaces, “correct and incorrect” component corresponds to the restriction to the subspace of the a terms, and to the complement of that, but so long as the subspace is low rank, “irrelevant component bigger so it initially dominates” still holds.

My remaining questions—I’d love to hear takes:

• The rank 2 case feels qualitatively different from the rank 1 case because there’s now a symmetry to break—will the first component of Z match the first or second component of C? Intuitively, breaking symmetries will create another S-shaped vibe, because the signal for getting close to the midpoint is high, while the signal to favour either specific component is lower.

• What happens in a cross-entropy loss style setup, rather than MSE loss? IMO cross-entropy loss is a better analogue to real networks. Though I’m confused about the right way to model an internal sub-circuit of the model. I think the exponential decay term just isn’t there?

• How does this interact with weight decay? This seems to give an intrinsic exponential decay to everything

• How does this interact with softmax? Intuitively, softmax feels “S-curve-ey”

• How does this with interact with Adam? In particular, Adam gets super messy because you can’t just disentangle things

• Even worse, how does it interact with AdamW?

• Thanks for sharing this! I’m excited to see more interpretability posts. (Though this felt far too high production value—more posts, shorter posts and lower effort per post plz)

If we plot the distribution of the singular vectors, we can see that the rank only slowly decreases until 64 then rapidly decreases. This is because, fundamentally, the OV matrix is only of rank 64. The singular value distribution of the meaningful ranks, however, declines slowly in log-space, giving at least some evidence towards the idea that the network is utilizing most of the ‘space’ available in this OV circuit head.

Quick feedback that the graph after this paragraph feels sketchy to me—obviously the singular values are zero beyond 64, and they’re so far low down that all singular values above look identical. But the y axis is screwed up, so you can’t really see this. What does the graph look like if you fix it? To me, it looks like there’s actually some sparsity and the early singular values are far larger (looks like there’s a big kink at the start, though it looks tiny because we’re so zoomed out).

I also personally think that a linear scale is often more principled for a spectrum graph, but not confident in that take.

• 24 Nov 2022 15:11 UTC
LW: 37 AF: 17
13 ∶ 1
AF
in reply to: Ben Pace’s comment

I’ll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like “come in, the water is fine, don’t worry, you won’t end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical”. While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in my interactions, and a route must be found where communicating such things (insofar as that’s what someone believes) isn’t going to destroy or end the coordination/​trade agreement. Speaking the truth is not something to be traded away, however costly it may be.

I can’t comment on Conjecture specifically’s coordination efforts, but I fairly strongly disagree with this as a philosophy of coordination. There exist a lot of people in the world who have massive empirical or ethical disagreements with me that lead to them taking actions I think are misguided to actively harmful to extremely dangerous. But I think that this often is either logical or understandable from their perspective. I think that being able to communicate productively with these people. see things from their point of view, and work towards common ground is a valuable skill, and an important part of the spirit of cooperation. For example, I think that Leah Garces’s work cooperating with chicken farmers to reduce factory farming is admirable and worthwhile, and I imagine she isn’t always frank and honest with people.

In particular, I think that being frank and honest in this context can basically kill possible cooperation. And good cooperation can lead to things being better by everyone’s lights, so this is a large and important cost not worth taking lightly. Not everyone has to strive for cooperation, but I think it’s very important that at least some people do! I do think that being so cooperative that you lose track of what you personally believe can be misguided and corrosive, but that there’s a big difference between having clear internal beliefs and needing to express all of those beliefs.

• 23 Nov 2022 23:52 UTC
LW: 35 AF: 13
6 ∶ 0
AF

Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.

Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour “I could probably spend a full-time day working on this and find something mildly cool worth writing up into a rough blog post”). I would love a wealth of these posts to exist that I could point people to and read myself! I’ve tried to set myself a much lower bar for this, and still mostly procrastinated on this. I would love to see this.

EDIT: This is also a comparative advantage of being an org outside academia whose employees mostly aren’t aiming for a future career in academia. I gather that in standard academic incentives, being scooped on your research makes the work much less impressive and publishable and can be bad for your career, discincentivising discussing partial results, especially in public. This seems pretty crippling to having healthy and collaborative discourse, but it’s also hard to fault people for following their incentives!

More generally, I really appreciate the reflective tone and candour of this post! I broadly agree re the main themes and that I don’t think Conjecture has really made actions that cut at the hard core of alignment, and these reflections seem plausible to me re concrete but fixable mistakes and deeper and more difficult problems. I look forwards to seeing what you do next!

# A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

22 Nov 2022 17:12 UTC
15 points

# Re­sults from the in­ter­pretabil­ity hackathon

17 Nov 2022 14:51 UTC
80 points
• Haven’t checked lol

• I do not think that that link is a helpful resource for figuring out the implications of the news right now. I would be very surprised if Bloomberg were that on the ball!

• See my other comment—it turns out to be the boring fact that there’s a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?)

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?

I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn’t used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn’t. And generally the kind of tasks that I’d expect to depend on tokens depend substantially on MLP0

• Thanks for clarifying your position, that all makes sense.

I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.

Huh, can you say more about this? I’m not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)

• Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it’s 0.0024 (avg absolute value of cosine sim is 0.1831)

• avg. pairwise cosine similarity is 0.960

Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn’t crazy tbh).

The Colab claims that the logit lens doesn’t work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn’t seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)

• 8 Nov 2022 19:38 UTC
LW: 4 AF: 2
1 ∶ 0
AF

Really interesting, thanks for sharing!

I find it super surprising that the tasks worked up until Gopher, but stopped working at PaLM. That’s such a narrow gap! That alone suggests some kind of interesting meta-level point re inverse scaling being rare, and that in fact the prize mostly picked up on the adverse selection of “the tasks that were inverse-y enough to not have issues on the models used.

One prediction this hypothesis makes is that people were overfitting to “what can GPT-3 not do” and thus that there’s a bunch of submitted tasks that were U-Shaped by Gopher, and the winning ones were just the ones that were U Shaped a bit beyond Gopher?

I’m also v curious how well these work on Chinchilla.