Neel Nanda

Karma: 7,954

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

46 points

0 comments4 min readLW link

Neel Nanda 20 Jul 2024 0:35 UTC
4 points
3
in reply to: Boris Kashirin’s comment on: JumpReLU SAEs + Early Access to Gemma 2 SAEs
Totally different

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

44 points

8 comments1 min readLW link

(storage.googleapis.com)

Neel Nanda 18 Jul 2024 23:34 UTC
2 points
0
in reply to: mic’s comment on: My experience applying to MATS 6.0
Thanks!

Neel Nanda 18 Jul 2024 23:33 UTC
6 points
0
in reply to: Zac Hatfield-Dodds’s comment on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Fair point, I’ve been procrastinating on putting out an updated version (and don’t have anything else I back enough to want to recommend in it’s place—I haven’t read this post closely enough yet), but adding that note to the top seems reasonable

Neel Nanda 18 Jul 2024 19:17 UTC
5 points
2
on: My experience applying to MATS 6.0
Thanks for writing this post! For the avoidance of confusion, my MATS stream has a very different admissions process, that is heavily based on a work task and doesn’t have interviews (and weights quite different things), see more details here: https://tinyurl.com/neel-mats-app

Neel Nanda 18 Jul 2024 19:11 UTC
LW: 13 AF: 8
0
AF
on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.

Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

40 points

0 comments10 min readLW link

Neel Nanda 16 Jul 2024 18:23 UTC
2 points
0
in reply to: Jinjin Zhao’s comment on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Imo they’re just completely different techniques, which aren’t really comparable. Activation patching is about understanding the difference between two activations by patching one to replace the other and seeing what happens. SAEs are a technique for decomposing an activation into interpretable pieces

Neel Nanda 15 Jul 2024 20:44 UTC
2 points
0
in reply to: samshap’s comment on: Stitching SAEs of different sizes
Interesting! You might be interested in a post from my team on inference-time optimization

It’s not clear to me what the right call here is though, because you want f to be something the model could extract. The encoder being so simple is in some ways a feature, not a bug—I wouldn’t want it to be eg a deep model, because the LLM can’t easily extract that!

Neel Nanda 14 Jul 2024 20:10 UTC
5 points
1
in reply to: Martin Vlach’s comment on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Oops

https://neelnanda.io/transformer-tutorial

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

34 points

12 comments12 min readLW link

Neel Nanda 12 Jul 2024 23:51 UTC
4 points
0
on: New page: Integrity
I’m pleasantly surprised by how short the Google DeepMind section is. How much do you think readers should read into that, vs eg “you’re in the Bay and hear more about Bay Area drama” or “you didn’t try very hard for GDM”

Neel Nanda 12 Jul 2024 21:05 UTC
16 points
2
in reply to: simeon_c’s comment on: Neel Nanda’s Shortform
I don’t quite understand the question. I’ve heard various bits of gossip, both as an employee and now. I wouldn’t say I’m confident in my understanding of any of it. I was somewhat sad about Jack and Dario’s public comments about thinking it’s too early to regulate (if I understood them correctly), which I also found surprising as I thought they had fairly short timelines, but policy is not at all my area of expertise so I am not confident in this take.

I think it’s totally plausible Anthropic has net negative impact, but the same is true for almost any significant actor in a complex situation. I agree that policy is one such way that their impact could be negative, though I’d generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.

Neel Nanda 12 Jul 2024 7:24 UTC
30 points
0
in reply to: habryka’s comment on: Habryka’s Shortform Feed
This is true. I signed a concealed non-disparagement when I left Anthropic in mid 2022. I don’t have clear evidence this happened to anyone else (but that’s not strong evidence of absence). More details here

EDIT: I should also clarify that I personally don’t think Anthropic acted that badly, and recommend reading about what actually happened before forming judgements. I do not think I am the person referred to in Habryka’s comment.

Neel Nanda 12 Jul 2024 7:22 UTC
32 points
2
in reply to: habryka’s comment on: Habryka’s Shortform Feed
I can confirm that my concealed non-disparagement was very explicit that I could not discuss the existence or terms of the agreement, I don’t see any way I could be misinterpreting this. (but I have now kindly been released from it!)

EDIT: It wouldn’t massively surprise me if Sam just wasn’t aware of its existence though

Neel Nanda 12 Jul 2024 7:16 UTC
294 points
30
on: Neel Nanda’s Shortform
In response to Habryka’s shortform, I can confirm that I signed a concealed non-disparagement as part of my Anthropic separation agreement. I worked there for 6 months and left in mid 2022. I received a cash payment as part of that agreement, with nothing shady going on a la threatening previous compensation (though I had no equity to threaten). In hindsight I undervalued my ability to speak freely, and didn’t more seriously consider that I could just decline to sign the separation agreement and walk away, I’m not sure what I would do if doing it again.

I asked Anthropic to release me from this after the comment thread started, and they have now released me from both the non-disparagement clause, and the non-disclosure part, which was very nice of them—I would encourage anyone in a similar situation to reach out to hr[at]anthropic.com and legal[at]anthropic.com, though obviously can’t guarantee that they’ll release everyone

I’ll take advantage of my newfound freedoms to say that...

Idk, I don’t really have anything too disparaging to say (though I dislike the use of concealed non-disparagements in general and am glad they say they’re stopping!). I’m broadly a fan of Anthropic, think their heart is likely in the right place and they’re trying to do what’s best for the world (though could easily be making the wrong calls) and would seriously consider returning in the right circumstances. I’ve recommended that several friends of mine accept offers to do safety and interp work there, and feel good about this (though would feel much more hesitant about recommending someone joins a pure capabilities team there). My biggest critique is that I have concerns about their willingness to push the capabilities frontier and worsen race dynamics and, while I can imagine reasonable justifications, I think they’re under valuing the importance of at least having clear public positions and rationales for this kind of thing and their clear shift in policies since Claude 1.0

EDIT: An additional detail that I genuinely appreciate is that Anthropic paid for me to have an independent lawyer to help explain the separation agreement and negotiate some changes on my behalf (I didn’t push back on the concealed non-disparagement, but did alter some other parts). They recommended an independent lawyer, who I used, but were also happy to pay for a lawyer of my choice. As far as I’m aware, this was quite a non-standard thing for a company to do, and I appreciate it and think this was good and ethical in a way that wasn’t obligatory.

EDIT 2: Someone asked that I share the terms of the agreement.

The non-disparagement clause:

Without prejudice to clause 6.3 [referring to my farewell letter to Anthropic staff, which I don’t think was disparaging or untrue, but to be safe], each party agrees that it will not make or publish or cause to be made or published any disparaging or untrue remark about the other party or, as the case may be, its directors, officers or employees. However, nothing in this clause or agreement will prevent any party to this agreement from (i) making a protected disclosure pursuant to Part IVA of the Employment Rights Act 1996 and/or (ii) reporting a criminal offence to any law enforcement agency and/or a regulatory breach to a regulatory authority and/or participating in any investigation or proceedings in either respect.

The non-disclosure clause:

Without prejudice to clause 6.3 [referring to my farewell letter to Anthropic staff] and 7 [about what kind of references Anthropic could provide for me], both Parties agree to keep the terms and existence of this agreement and the circumstances leading up to the termination of the Consultant’s engagement and the completion of this agreement confidential save as [a bunch of legal boilerplate, and two bounded exceptions I asked for but would rather not publicly share. I don’t think these change anything, but feel free to DM if you want to know]
What links here?
- AI #73: Openly Evil AI by Zvi (18 Jul 2024 14:40 UTC; 85 points)
- Neel Nanda's comment on Habryka’s Shortform Feed by habryka (12 Jul 2024 7:24 UTC; 30 points)

Neel Nanda’s Shortform

Neel Nanda12 Jul 2024 7:16 UTC

8 points

6 comments1 min readLW link

Neel Nanda 11 Jul 2024 9:46 UTC
2 points
0
in reply to: Jason Gross’s comment on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I’ve hedged the claim to “one of the most”, does that seem reasonable?

I haven’t deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I’ve instead linked to your comment from the post

Neel Nanda 10 Jul 2024 10:14 UTC
LW: 5 AF: 4
0
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
I found this comment very helpful, and also expected probing to be about as good, thank you!

Neel Nanda

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Stitch­ing SAEs of differ­ent sizes

Neel Nanda’s Shortform

BatchTopK: A Simple Improvement for TopK-SAEs

JumpReLU SAEs + Early Access to Gemma 2 SAEs

SAEs (usually) Transfer Between Base and Chat Models

Stitching SAEs of different sizes