Neel Nanda

Karma: 6,762

Neel Nanda 6 Feb 2023 20:27 UTC
LW: 116 AF: 50
11
AF
on: SolidGoldMagikarp (plus, prompt generation)
TLDR: The model ignores weird tokens when learning the embedding, and never predicts them in the output. In GPT-3 this means the model breaks a bit when a weird token is in the input, and will refuse to ever output it because it’s hard coded the frequency statistics, and it’s “repeat this token” circuits don’t work on tokens it never needed to learn it for. In GPT-2, unlike GPT-3, embeddings are tied, meaningW_U = W_E.T, which explains much of the weird shit you see, because this is actually behaviour in the unembedding not the embedding (weird tokens never come up in the text, and should never be predicted, so there’s a lot of gradient signal in the unembed, zero in the embed).

In particular, I think that your clustering results are an artefact of how GPT-2 was trained and do not generalise to GPT-3

Fun results! A key detail that helps explain these results is that in GPT-2 the embedding and unembedding are tied, meaning that the linear map from the final residual stream to the output logits logits = final_residual @ W_U is the transpose of the embedding matrix, ie W_U = W_E.T, where W_E[token_index] is the embedding of that token. But I believe that GPT-3 was not trained with tied embeddings, so will have very different phenomena here.

My mental model for what’s going on:

Let’s consider the case of untied embeddings first, so GPT-3:
- For some stupid reason, the tokenizer has some irrelevant tokens that never occur in the training data. Your guesses seem reasonable here.
  - In OpenWebText, there’s 99 tokens in GPT-2′s tokenizer that never occur, and a bunch that are crazy niche, like ′ petertodd’
- Embed: Because these are never in the training data, the model completely doesn’t care about their embedding, and never changes them (or, if they occur very rarely, it does some random jank). This means they remain close to their random initialisation
  - Models are trained with weight decay, which incentivises these to be set to zero, but I believe that weight decay doesn’t apply to the embeddings
  - Models are not used to having tokens deleted from their inputs, and so deleting this breaks things, which isn’t that surprising.
    OTOH, if they genuinely do normalise to norm 1 (for some reason), the tokens are probably just embedding to a weird bit of embedding space that the model doesn’t expect. I imagine this will still break things, but it might just let the model confuse it with a token that happens to be nearby? I don’t have great intuitions here
- Unembed: Because these are never in the training data, the model wants to never predict them, ie have big negative logits. The two easiest ways to do this are to give them trivial weights and a big negative bias term, or big weights and align them with a bias direction in final residual stream space (ie, a direction that always has a high positive component, so it can be treated as approx constant).
  - Either way, the observed effect is that the model will never predict them, which totally matches what you see.
As a cute demonstration of this, we can plot a scatter graph of log(freq_in_openwebtext+1) against unembed bias (which comes from the folded layernorm bias) coloured by the centered norm of the token embedding. We see that the unembed bias is mostly used to give frequency, but that at the tail end of rare tokens, some have tiny unembed norm and big negative bias, and others have high unembed norm and a less negative bias.

The case of tied embeddings is messier, because the model wants to do these two very different things at once! But since, again, it doesn’t care about the embedding at all (it’s not that it wants the token’s embedding to be close to zero, it’s that there’s never an incentive to update the gradients). So the effect will be dominated by what the unembed wants, which is getting their logits close to zero.

The unembed doesn’t care about the average token embedding, since adding a constant to every logit does nothing. The model wants a non-trivial average token embedding to use as a bias term (probably), so there’ll be a non-trivial average token embedding (as we see), but it’s boring and not relevant.

So the model’s embedding for the weird tokens will be optimised for giving a big negative logit in the unembedding, which is a weird and unnatural thing to do, and I expect is the seed of your weird results.

One important-ish caveat is that the unembed isn’t quite the transpose of the embed. There’s a LayerNorm immediately before the unembed, whose scale weights get folded into W_E.T to create an effective unembed (ie W_U_effective = w[:, None] * W_E.T), which breaks symmetry a bit. Hilariously, the model is totally accounting for this—if you plot norm of unembed and norm of embed against each other for each token, they track each other pretty well, except for the stupid rare tokens, which go wildly off the side.

Honestly, I’m most surprised that GPT-3 uses the same tokenizer as GPT-2! There’s a lot of random jank in there, and I’m surprised they didn’t change it.

Another fun fact about tokenizers (god I hate tokenizers) is that they’re formed recursively by finding the most common pair of existing tokens and merging those into a new token. Which means that if you get eg common triples like ABC, but never AB followed by not C, you’ll add in token AB, and then token ABC, and retain the vestigial token AB, which could also create the stupid token behaviour. Eg ” The Nitrome” is token 42,089 in GPT-2 and ” TheNitromeFan” is token 42,090, not that either actually come up in OpenWebText!

To check this, you’d want to look at a model trained with untied embeddings. Sadly, all the ones I’m aware of (Eleuther’s Pythia, and my interpretability friendly models) were trained on the GPT-NeoX tokenizer or variants, whcih doesn’t seem to have stupid tokens in the same way.

Neel Nanda 24 Nov 2022 15:11 UTC
LW: 51 AF: 21
33
AF
in reply to: Ben Pace’s comment on: Conjecture: a retrospective after 8 months of work

I’ll mention here that from what little I saw, your coordination efforts seemed a bit misguided and harmful. I am left with a sense that you wanted everyone to be friends and not feel threatened by interacting with others, to feel like “come in, the water is fine, don’t worry, you won’t end up with people criticizing you for maybe ending civilization or self-deceiving along the way or call you unethical”. While I am pretty open to fairly respectful coordination and am a strong fan of finding positive-sum trade, I care more about being frank and honest in my interactions, and a route must be found where communicating such things (insofar as that’s what someone believes) isn’t going to destroy or end the coordination/trade agreement. Speaking the truth is not something to be traded away, however costly it may be.

I can’t comment on Conjecture specifically’s coordination efforts, but I fairly strongly disagree with this as a philosophy of coordination. There exist a lot of people in the world who have massive empirical or ethical disagreements with me that lead to them taking actions I think are misguided to actively harmful to extremely dangerous. But I think that this often is either logical or understandable from their perspective. I think that being able to communicate productively with these people. see things from their point of view, and work towards common ground is a valuable skill, and an important part of the spirit of cooperation. For example, I think that Leah Garces’s work cooperating with chicken farmers to reduce factory farming is admirable and worthwhile, and I imagine she isn’t always frank and honest with people.

In particular, I think that being frank and honest in this context can basically kill possible cooperation. And good cooperation can lead to things being better by everyone’s lights, so this is a large and important cost not worth taking lightly. Not everyone has to strive for cooperation, but I think it’s very important that at least some people do! I do think that being so cooperative that you lose track of what you personally believe can be misguided and corrosive, but that there’s a big difference between having clear internal beliefs and needing to express all of those beliefs.

Neel Nanda 4 Mar 2023 11:25 UTC
47 points
49
in reply to: quanticle’s comment on: The Parable of the King and the Random Process
I’m confused by this example. This seems exactly the kind of time where an averaged point estimate is the correct answer. Say there’s a 50% chance the company survives and is worth $100 and a 50% chance it doesn’t and is worth $0. In this case, I am happy to buy or sell the price at $50.

Doing research to figure out it’s actually an 80% chance of $100 means you can buy a bunch and make $30 in expected profit. This isn’t anything special though—if you can do research and form better beliefs than the market, you should make money. The different world models don’t seem relevant here to me?

Neel Nanda 1 Apr 2024 12:13 UTC
45 points
9
on: A Selection of Randomly Selected SAE Features

Neel Nanda 28 Apr 2021 21:36 UTC
LW: 40 AF: 15
AF
on: AMA: Paul Christiano, alignment researcher
What are the most important ideas floating around in alignment research that don’t yet have a public write-up? (Or, even better, that have a public write-up but could do with a good one?)

Neel Nanda 23 Nov 2022 23:52 UTC
LW: 38 AF: 13
20
AF
on: Conjecture: a retrospective after 8 months of work

Another reason for publishing more quickly is that conversations with many interpretability researchers have led us to believe that there is a wealth of knowledge in short experiments and unpublished research that really should be shared. We’d encourage other organizations who think similarly to post frequently, and share results even if they’re not completely polished.

Extremely strong +1! There is so much low-hanging fruit in mechanistic interpretability (of the flavour “I could probably spend a full-time day working on this and find something mildly cool worth writing up into a rough blog post”). I would love a wealth of these posts to exist that I could point people to and read myself! I’ve tried to set myself a much lower bar for this, and still mostly procrastinated on this. I would love to see this.

EDIT: This is also a comparative advantage of being an org outside academia whose employees mostly aren’t aiming for a future career in academia. I gather that in standard academic incentives, being scooped on your research makes the work much less impressive and publishable and can be bad for your career, discincentivising discussing partial results, especially in public. This seems pretty crippling to having healthy and collaborative discourse, but it’s also hard to fault people for following their incentives!

More generally, I really appreciate the reflective tone and candour of this post! I broadly agree re the main themes and that I don’t think Conjecture has really made actions that cut at the hard core of alignment, and these reflections seem plausible to me re concrete but fixable mistakes and deeper and more difficult problems. I look forwards to seeing what you do next!

Neel Nanda 27 Sep 2021 14:42 UTC
37 points
in reply to: DanielFilan’s comment on: Petrov Day 2021: Mutually Assured Destruction?
As a counter-point, my day was made significantly better by the front page being nuked in 2020 - it was exciting, novel, hilarious (by my lights—clearly not to some people), made some excellent points about phishing and security, and gave me opportunities to dissect why people oriented to this event differently from me. I expect my experience would have been less good last year had the phishing attempt not happened, and we all simply coordinated. More generally, when a website does something unusual and novel like this, I feel like the value of novelty and interestingness can outweigh the costs of a single day of disrupted use?
I’d further argue that the people highly invested in this seem much more invested in the abstract ideas of trust, community, shared ritual and cohesion, more so than the object level of the frontpage being down (besides, people can always use greaterwrong.com )

Neel Nanda 28 Apr 2021 21:39 UTC
LW: 35 AF: 14
AF
on: AMA: Paul Christiano, alignment researcher
Pre-hindsight: 100 years from now, it is clear that your research has been net bad for the long-term future. What happened?

Neel Nanda 26 Sep 2021 23:05 UTC
33 points
in reply to: habryka’s comment on: Petrov Day 2021: Mutually Assured Destruction?
The obvious thing is to ask people to consent before entering the game? It’s weird to get an email, out of the blue, with launch codes, telling you that you are now part of this game. While an email that spells out some of the explicit norms, and asks people to opt-in, seems great.
A light-touch intervention could just be giving people a link to click to get the launch codes, that shows some text spelling out norms like this, and ask people to only click the link if they actually want to participate.
EDIT: To be clear, I am participating in this, and would have opted-in—I just think it’s a really bad norm to not ask for consent first, when we’re putting people in a situation with real risks and social consequences, and with wildly differing perceptions of the depth of meaning in this event.

Neel Nanda 19 Jun 2022 22:59 UTC
LW: 31 AF: 8
20
AF
in reply to: Daniel Kokotajlo’s comment on: Where I agree and disagree with Eliezer
I broadly agree with this much more than Eliezer’s and think this did a good job of articulating a bunch of my fuzzy “this seems off”. Most notably, Eliezer underrating the Importance and tractability of interpretability, and overrating the discontinuity of AI progress

Neel Nanda 28 Apr 2021 21:49 UTC
LW: 31 AF: 11
AF
on: AMA: Paul Christiano, alignment researcher
Do you have any advice for junior alignment researchers? In particular, what do you think are the skills and traits that make someone an excellent alignment researcher? And what do you think someone can do early in a research career to be more likely to become an excellent alignment researcher?

Neel Nanda 28 Apr 2021 21:33 UTC
LW: 30 AF: 11
AF
on: AMA: Paul Christiano, alignment researcher
What are the highest priority things (by your lights) in Alignment that nobody is currently seriously working on?

Neel Nanda 13 Dec 2022 16:56 UTC
LW: 29 AF: 11
26
AF
on: Existential AI Safety is NOT separate from near-term applications
Non X-risks from AI are still intrinsically important AI safety issues.
I want to push back on this—I think it’s true as stated, but that emphasising it can be misleading.
Concretely, I think that there can be important near-term, non-X-risk AI problems that meet the priority bar to work on. But the standard EA mindset of importance, tractability and neglectedness still applies. And I think often near-term problems are salient and politically charged, in a way that makes these harder to evaluate.
I think these are most justified on problems with products that are very widely used and without much corporate incentive to fix the issues (recommender system alignment is the most obvious example here)
I broadly agree with and appreciate the rest of this post though! And want to distinguish between “this is not a cause area that I think EAs should push on on the margin” and “this cause area does not matter”—I think work to make systems less deceptive, racist, and otherwise harmful seems pretty great.

Neel Nanda 7 Sep 2021 13:48 UTC
29 points
on: LessWrong is providing feedback and proofreading on drafts as a service
This seems like a really great initiative, I’m excited to see how it goes.
How high a bar should I set for using this service? I have basically no posts where I’d post this with editing help but would not post it on my own, but would generally appreciate editing help on basically every post I make.

Neel Nanda 23 Jun 2022 1:41 UTC
28 points
on: Announcing the LessWrong Curated Podcast
I’m super excited about this! I find it much lower effort to consume audio content than text, and am a big fan of the SSC podcast. I expect this to significantly increase the number of curated LW posts I read

Neel Nanda 24 Sep 2023 19:00 UTC
27 points
12
on: Paper: LLMs trained on “A is B” fail to learn “B is A”
I find this pretty unsurprising from a mechanistic interpretability perspective—the internal mechanism here is a lookup table mapping “input A” to “output B” which is fundamentally different from the mechanism mapping “input B” to “output A”, and I can’t really see a reasonable way for the symmetry to be implemented at all. I made a Twitter thread explaining this in more detail, which people may find interesting.

Neel Nanda 13 Sep 2022 11:20 UTC
LW: 27 AF: 15
4
AF
on: Path dependence in ML inductive biases
Interesting post! I’m pretty curious about these.
A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models—they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:
- For each pair of models, feed in a bunch of text and look at the log prob for predicting each next token, and look at the scatter plot of these—does it look highly correlated? Poke at any outliers and see if there are any consistent patterns of things one model can do and the other cannot
  - Repeat this for a checkpoint halfway through training. If you find capabilities in one model and not in another, have they converged by the end of training?
  - Look at the PCA of these per-token losses across, say, 1M tokens of text, and see if you can find anything interesting about the components
- Evaluate the models for a bunch of behaviours—ability to use punctuation correctly, to match open and close parentheses, patterns in the syntax and structure of the data (capital letters at the start of a sentence, email addresses having an @ and a .com in them, taking text in other languages and continuing it with text of that language, etc), specific behaviour like the ability to memorise specific phrases, complete acronyms, use induction-like behaviour, basic factual knowledge about the world, etc
  - The medium models will have more interesting + sophisticated behaviour, and are probably a better place to look for specific circuits
- Look at the per-token losses for some text over training (esp for tokens with significant deviation between final models) and see whether it looks smooth or S-shaped—S-shaped would suggest higher path dependence to me
- Look for induction head phase changes in each model during training, and compare when they happen.
I’m currently writing a library for mechanistic interpretability of LLMs, with support for loading these models + their checkpoints—if anyone might be interested on working on this, happy to share ideas. This is a small subset of OpenWebText that seems useful for testing.
Unrelatedly, a mark against path dependence is the induction head bump result, where we found that models have a phase change where they suddenly form induction heads, and that across a range of model sizes and architecture it forms consistently and around the same point (though not all architectures tested). Anecdotally, I’ve found that the time of formation is very sensitive to the exact positional embeddings used though.

Neel Nanda 30 Sep 2020 6:49 UTC
26 points
in reply to: jacobjacob’s comment on: On Destroying the World
We’ll come to this in a moment, but first I want to address his final sentence: “Like, the email literally said you were chosen to participate because we trusted you to not actually use the codes”. I’ve played lot of role-playing games back in my day and often people write all kinds of things as flavour text. And none of it is meant to be taken literally.
I want to point out a few things in particular. Firstly, the email was sent out to 270 users which from my perspective made it seem that the website was almost guaranteed to go down at some time, with the only question being when (I was aware the game was played last year, but I had no memory of the outcome or the number of users).
Beyond this, the fact that the message said, “Hello Chris_Leong” and that it was sent to 270 users meant that it didn’t really feel like a personal request from Ben Pace. Additionally, note the somewhat jokey tone of the final sentence, “I hope to see you in the dawn of tomorrow, with our honor still intact”. Obviously, someone pressing the button wouldn’t damage the honor or reputation of Less Wrong and so it seemed to indicate that this was just a bit of fun..
I resonate with basically all of this from Chris’ post
Trying to introspect a bit more, I think that unseeing the cultural context is hard, and that that context massively affects your priors of how to interpret something like this. My first reaction was that the email was a joke. Then, realising it wasn’t a joke, being confused by why I’d been sent it (the email began Dear Neel_Nanda_1 , not Dear Neel, which made it seem less like I’d been specially chosen). And then, realising that they’d actually changed the Front Page, and done this before, being really entertained at the idea of celebrating Petrov Day in this way. But it felt like “this is a fun, slightly over the top way of celebrating, and we want to see interesting and fun things happen”.
I think my priors are so far from people taking something as minor as “the LW frontpage goes down for a day” seriously, that it took me reading the thread under jefftk last year about selling his launch codes for counterfactual donations, and seeing people genuinely debate “is this worth more than $1.6K” to realise that people took the symbolic value really seriously. (And I’m still pretty confused by this—if I had read about Petrov Day 2019, and saw that someone blew up the front page for a large donation to AMF, that would probably marginally raise my opinion of LessWrong. And I utterly do not understand people who would price as over $10K, let alone $1m)
Things that I think would have changed my intuitive framing:
- Having the email drop out of “RPG flavour text mode” and be explicit about the cultural context and how seriously people took it
- Having the downside be actually, meaningfully high (eg, LW the website going down for a month) ((though I think this is net bad for the obvious reasons)). As is, it didn’t feel like something to be taken seriously, because the actual stakes were low.
- Being given context and invited before Petrov day, and needing to take some agency to accept. I think this would have made the notion of “you are being invited and trusted clearer”. I was surprised by receiving the email and don’t see myself as a notable LW contributor, and assumed eg this was automatically sent to the 270 most recent posters, or highest karma users or something, rather than having been hand-picked by Ben
(In writing all of this, I feel like I’m being unfair to Ben/implying all of this should have been obvious to you guys. That’s not at all my intent, and I hope you take this in the spirit of “an attempt to narrate my internal experience, that might help with orchestrating future things”)
Idk, hope all that helped. This kind of thing is far outside my standard conception of how people and communities work, and I’m not used to people taking symbols this seriously. And I’m surprised by how obvious this all feels to people with the cultural context
What links here?
- Postmortem to Petrov Day, 2020 by Ben Pace (3 Oct 2020 21:30 UTC; 97 points)
- Neel Nanda's comment on Petrov Day 2021: Mutually Assured Destruction? by Ruby (27 Sep 2021 14:35 UTC; 5 points)

Neel Nanda 22 Apr 2023 22:00 UTC
25 points
22
on: The Security Mindset, S-Risk and Publishing Prosaic Alignment Research

Some examples of justifications I have given to myself are “You’re so new to this, this is not going to have any real impact anyway”,

I think this argument is just clearly correct among people new to the field—thinking that your work may be relevant to alignment is motivating and exciting and represents the path to eventually doing useful things, but it’s also very likely to be wrong. Being repeatedly wrong is what improvement feels like!

People new to the field tend to wildly overthink the harms of publishing, in a way that increases their anxiety and makes them much more likely to bounce off. This is a bad dynamic, and I wish people would stop promoting it

Neel Nanda 28 Sep 2020 15:27 UTC
25 points
on: On Destroying the World
Thanks for writing this! It seemed like people were being unwarrantedly unfair to you in that thread.
My personal experience was getting the email from Ben, and this being the first I’d ever heard about LessWrong’s approach to Petrov Day. And I somewhat considered pressing the button for the entertainment value, until I read the comments on the 2019 thread and got a sense of how seriously people took it.
I think it’s completely reasonable to not have gotten that cultural context from the information available, and so not to have taken the whole thing super seriously.
And personally I found it fairly entertaining/education how all of this turned out (though it’s definitely sad for all the Pacific time people who were asleep throughout the whole thing :( )
EDIT: Just wanted to add that, now I have the cultural context, I think this was all an awesome celebration and I’m flattered to have been invited to be a part of it! My main critique was that I think it’s extremely reasonable for Chris not to have had the relevant context, but many of those commenting seem to have taken this background context as a given, since it’s clear to them.

Neel Nanda

My mental model for what’s going on: