Neel Nanda

Karma: 6,751

Neel Nanda 29 Apr 2024 2:07 UTC
LW: 2 AF: 2
0
AF
in reply to: LawrenceC’s comment on: Refusal in LLMs is mediated by a single direction
Thanks! Broadly agreed

For example, I think our understanding of Grokking in late 2022 turned out to be importantly incomplete.

I’d be curious to hear more about what you meant by this

Neel Nanda 28 Apr 2024 22:46 UTC
LW: 4 AF: 4
0
AF
in reply to: magnetoid’s comment on: Refusal in LLMs is mediated by a single direction
It was added recently and just added to a new release, so pip install transformer_lens should work now/soon (you want v1.16.0 I think), otherwise you can install from the Github repo

Neel Nanda 28 Apr 2024 11:05 UTC
LW: 10 AF: 7
3
AF
in reply to: cousin_it’s comment on: Refusal in LLMs is mediated by a single direction
There’s been a fair amount of work on activation steering and similar techniques,, with bearing in eg sycophancy and truthfulness, where you find the vector and inject it eg Rimsky et al and Zou et al. It seems to work decently well. We found it hard to bypass refusal by steering and instead got it to work by ablation, which I haven’t seen much elsewhere, but I could easily be missing references

Neel Nanda 28 Apr 2024 11:00 UTC
LW: 25 AF: 14
13
AF
in reply to: Zack_M_Davis’s comment on: Refusal in LLMs is mediated by a single direction
First and foremost, this is interpretability work, not directly safety work. Our goal was to see if insights about model internals could be applied to do anything useful on a real world task, as validation that our techniques and models of interpretability were correct. I would tentatively say that we succeeded here, though less than I would have liked. We are not making a strong statement that addressing refusals is a high importance safety problem.

I do want to push back on the broader point though, I think getting refusals right does matter. I think a lot of the corporate censorship stuff is dumb, and I could not care less about whether GPT4 says naughty words. And IMO it’s not very relevant to deceptive alignment threat models, which I care a lot about. But I think it’s quite important for minimising misuse of models, which is also important: we will eventually get models capable of eg helping terrorists make better bioweapons (though I don’t think we currently have such), and people will want to deploy those behind an API. I would like them to be as jailbreak proof as possible!
What links here?
- LawrenceC's comment on Refusal in LLMs is mediated by a single direction by Andy Arditi (29 Apr 2024 0:58 UTC; 13 points)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

127 points

40 comments10 min readLW link

Neel Nanda 26 Apr 2024 2:48 UTC
LW: 4 AF: 3
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This’ll be added to the paper soon, sorry!). I’ll leave the other Qs for my co-authors

Neel Nanda 26 Apr 2024 0:14 UTC
LW: 6 AF: 5
4
AF
in reply to: Sam Marks’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
I haven’t fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.

Neel Nanda 25 Apr 2024 23:04 UTC
8 points
6
on: Funny Anecdote of Eliezer From His Sister
From the title I expected this to be embarrassing for Eliezer, but that was actually extremely sweet, and good advice!

Neel Nanda 25 Apr 2024 21:03 UTC
LW: 2 AF: 2
0
AF
in reply to: Sam Marks’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders

Great work! Obviously the results here speak for themselves, but I especially wanted to complement the authors on the writing. I thought this paper was a pleasure to read, and easily a top 5% exemplar of clear technical writing. Thanks for putting in the effort on that.

<3 Thanks so much, that’s extremely kind. Credit entirely goes to Sen and Arthur, which is even more impressive given that they somehow took this from a blog post to a paper in a two week sprint! (including re-running all the experiments!!)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

60 points

23 comments1 min readLW link

(arxiv.org)

How to use and interpret activation patching

StefanHex and Neel Nanda

24 Apr 2024 8:35 UTC

10 points

0 comments18 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

71 points

8 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

68 points

0 comments3 min readLW link

Neel Nanda 17 Apr 2024 10:56 UTC
19 points
9
in reply to: habryka’s comment on: Paul Christiano named as US AI Safety Institute Head of AI Safety
It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?

Neel Nanda 9 Apr 2024 23:07 UTC
13 points
4
on: Ophiology (or, how the Mamba architecture works)
Thanks for the clear explanation, Mamba is more cursed and less Transformer like than I realised! And thanks for creating and open sourcing Mamba Lens, it looks like a very useful tool for anyone wanting to build on this stuff

Neel Nanda 8 Apr 2024 16:50 UTC
3 points
0
on: Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
Each element of the $GatePattern$ matrix, denoted as ${GatePattern}_{i j}$ , is constrained to the interval $[0, 1)$ . This means that for all $i, j$ , where $i$ indexes the query positions and $j$ indexes the key positions:
$0 \leq {GatePattern}_{i j} < 1$
Why is this strictly less than 1? Surely if the dot product is 1.1 and you clamp, it gets clamped to exactly 1

Neel Nanda 7 Apr 2024 21:49 UTC
5 points
2
in reply to: jsd’s comment on: The Best Tacit Knowledge Videos on Every Subject
Oh nice, I didn’t know Evan had a YouTube channel. He’s one of the most renowned olympiad coaches and seems highly competent

Neel Nanda 6 Apr 2024 10:55 UTC
LW: 2 AF: 2
0
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
Thanks! I read and enjoyed the book based on this recommendation

Neel Nanda 3 Apr 2024 10:05 UTC
10 points
7
in reply to: quila’s comment on: LessWrong: After Dark, a new side of LessWrong
I’m in favour of people having hobbies and fun projects to do in their downtime! That seems good and valuable for impact over the longterm, rather than thinking that every last moment needs to be productive

Neel Nanda 1 Apr 2024 12:13 UTC
45 points
9
on: A Selection of Randomly Selected SAE Features

Neel Nanda

Re­fusal in LLMs is me­di­ated by a sin­gle direction

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

How to use and in­ter­pret ac­ti­va­tion patching

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

Refusal in LLMs is mediated by a single direction

Improving Dictionary Learning with Gated Sparse Autoencoders

How to use and interpret activation patching

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Summary] Progress Update #1 from the GDM Mech Interp Team