Thanks!
Neel Nanda
BatchTopK: A Simple Improvement for TopK-SAEs
JumpReLU SAEs + Early Access to Gemma 2 SAEs
Fair point, I’ve been procrastinating on putting out an updated version (and don’t have anything else I back enough to want to recommend in it’s place—I haven’t read this post closely enough yet), but adding that note to the top seems reasonable
Thanks for writing this post! For the avoidance of confusion, my MATS stream has a very different admissions process, that is heavily based on a work task and doesn’t have interviews (and weights quite different things), see more details here: https://tinyurl.com/neel-mats-app
Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.
Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!
SAEs (usually) Transfer Between Base and Chat Models
Imo they’re just completely different techniques, which aren’t really comparable. Activation patching is about understanding the difference between two activations by patching one to replace the other and seeing what happens. SAEs are a technique for decomposing an activation into interpretable pieces
Interesting! You might be interested in a post from my team on inference-time optimization
It’s not clear to me what the right call here is though, because you want f to be something the model could extract. The encoder being so simple is in some ways a feature, not a bug—I wouldn’t want it to be eg a deep model, because the LLM can’t easily extract that!
Stitching SAEs of different sizes
I’m pleasantly surprised by how short the Google DeepMind section is. How much do you think readers should read into that, vs eg “you’re in the Bay and hear more about Bay Area drama” or “you didn’t try very hard for GDM”
I don’t quite understand the question. I’ve heard various bits of gossip, both as an employee and now. I wouldn’t say I’m confident in my understanding of any of it. I was somewhat sad about Jack and Dario’s public comments about thinking it’s too early to regulate (if I understood them correctly), which I also found surprising as I thought they had fairly short timelines, but policy is not at all my area of expertise so I am not confident in this take.
I think it’s totally plausible Anthropic has net negative impact, but the same is true for almost any significant actor in a complex situation. I agree that policy is one such way that their impact could be negative, though I’d generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.
This is true. I signed a concealed non-disparagement when I left Anthropic in mid 2022. I don’t have clear evidence this happened to anyone else (but that’s not strong evidence of absence). More details here
EDIT: I should also clarify that I personally don’t think Anthropic acted that badly, and recommend reading about what actually happened before forming judgements. I do not think I am the person referred to in Habryka’s comment.
I can confirm that my concealed non-disparagement was very explicit that I could not discuss the existence or terms of the agreement, I don’t see any way I could be misinterpreting this. (but I have now kindly been released from it!)
EDIT: It wouldn’t massively surprise me if Sam just wasn’t aware of its existence though
In response to Habryka’s shortform, I can confirm that I signed a concealed non-disparagement as part of my Anthropic separation agreement. I worked there for 6 months and left in mid 2022. I received a cash payment as part of that agreement, with nothing shady going on a la threatening previous compensation (though I had no equity to threaten). In hindsight I undervalued my ability to speak freely, and didn’t more seriously consider that I could just decline to sign the separation agreement and walk away, I’m not sure what I would do if doing it again.
I asked Anthropic to release me from this after the comment thread started, and they have now released me from both the non-disparagement clause, and the non-disclosure part, which was very nice of them—I would encourage anyone in a similar situation to reach out to hr[at]anthropic.com and legal[at]anthropic.com, though obviously can’t guarantee that they’ll release everyone
I’ll take advantage of my newfound freedoms to say that...
Idk, I don’t really have anything too disparaging to say (though I dislike the use of concealed non-disparagements in general and am glad they say they’re stopping!). I’m broadly a fan of Anthropic, think their heart is likely in the right place and they’re trying to do what’s best for the world (though could easily be making the wrong calls) and would seriously consider returning in the right circumstances. I’ve recommended that several friends of mine accept offers to do safety and interp work there, and feel good about this (though would feel much more hesitant about recommending someone joins a pure capabilities team there). My biggest critique is that I have concerns about their willingness to push the capabilities frontier and worsen race dynamics and, while I can imagine reasonable justifications, I think they’re under valuing the importance of at least having clear public positions and rationales for this kind of thing and their clear shift in policies since Claude 1.0
EDIT: An additional detail that I genuinely appreciate is that Anthropic paid for me to have an independent lawyer to help explain the separation agreement and negotiate some changes on my behalf (I didn’t push back on the concealed non-disparagement, but did alter some other parts). They recommended an independent lawyer, who I used, but were also happy to pay for a lawyer of my choice. As far as I’m aware, this was quite a non-standard thing for a company to do, and I appreciate it and think this was good and ethical in a way that wasn’t obligatory.
EDIT 2: Someone asked that I share the terms of the agreement.
The non-disparagement clause:
Without prejudice to clause 6.3 [referring to my farewell letter to Anthropic staff, which I don’t think was disparaging or untrue, but to be safe], each party agrees that it will not make or publish or cause to be made or published any disparaging or untrue remark about the other party or, as the case may be, its directors, officers or employees. However, nothing in this clause or agreement will prevent any party to this agreement from (i) making a protected disclosure pursuant to Part IVA of the Employment Rights Act 1996 and/or (ii) reporting a criminal offence to any law enforcement agency and/or a regulatory breach to a regulatory authority and/or participating in any investigation or proceedings in either respect.
The non-disclosure clause:
Without prejudice to clause 6.3 [referring to my farewell letter to Anthropic staff] and 7 [about what kind of references Anthropic could provide for me], both Parties agree to keep the terms and existence of this agreement and the circumstances leading up to the termination of the Consultant’s engagement and the completion of this agreement confidential save as [a bunch of legal boilerplate, and two bounded exceptions I asked for but would rather not publicly share. I don’t think these change anything, but feel free to DM if you want to know]
- AI #73: Openly Evil AI by 18 Jul 2024 14:40 UTC; 85 points) (
- 12 Jul 2024 7:24 UTC; 30 points) 's comment on Habryka’s Shortform Feed by (
Neel Nanda’s Shortform
Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I’ve hedged the claim to “one of the most”, does that seem reasonable?
I haven’t deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I’ve instead linked to your comment from the post
I found this comment very helpful, and also expected probing to be about as good, thank you!
Totally different