Arthur Conmy

Karma: 1,064

Intepretability

Views my own

Arthur Conmy 1 May 2024 1:06 UTC
LW: 2 AF: 1
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Ah yeah, Neel’s comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic’s work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.

Arthur Conmy 1 May 2024 1:02 UTC
LW: 2 AF: 1
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
I don’t think zero ablation is that great a baseline. We’re mostly using it for continuity’s sake with Anthropic’s prior work (and also it’s a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don’t think compute efficiency hit^[1] is ideal. It’s really expensive to compute, since you can’t just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn’t transfer as well between sites (splicing in an attention layer SAE doesn’t impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it’s a useful expensive alternative to loss recovered, not a replacement.

EDIT: on consideration of Leo’s reply, I think my point about transfer is wrong; a metric like “compute efficiency recovered” could always be created by rescaling the compute efficiency number.
1. ^
  What I understand “compute efficiency hit” to mean is: for a given (SAE, $L M_{1}$ ) pair, how much less compute you’d need (as a multiplier) to train a different LM, $L M_{2}$ such that $L M_{2}$ gets the same loss as $L M_{1}$ -with-the-SAE-spliced-in.

Arthur Conmy 1 May 2024 0:49 UTC
LW: 2 AF: 1
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
I’m not sure what you mean by “the reinitialization approach” but feature death doesn’t seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn’t have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.

Arthur Conmy 30 Apr 2024 22:58 UTC
LW: 24 AF: 16
25
AF
in reply to: Arthur Conmy’s comment on: Refusal in LLMs is mediated by a single direction
The “This should be cited” part of Dan H’s comment was edited in after the author’s reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.

On the other hand the post’s authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).

(The authors made me aware of this fact)

Arthur Conmy 29 Apr 2024 21:25 UTC
11 points
7
on: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
Awesome work! I notice I am surprised that this just worked given just 1M datapoints (we use 1000x this with LMs, even small ones), and not needing any new techniques, and producing subjectively extremely abstract features (IMO).
It would be nice if the “guess the image” game was presented as a result rather than a fun side thing in this post. AFAICT that’s the only interpretability result that can’t be critiqued as cherry-picked. You should state front and center that the top features for arbitrary images are basically interpretable, it’s a great result!

Arthur Conmy 29 Apr 2024 18:17 UTC
3 points
0
in reply to: Dan Braun’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Thanks for the feedback, we will put up an update to the paper with all these numbers in tables, tomorrow night. For now I have sent you them (and can send anyone else them who wants them in the next 24H)

Arthur Conmy 28 Apr 2024 23:57 UTC
2 points
0
in reply to: magnetoid’s comment on: Refusal in LLMs is mediated by a single direction
+1 to Neel. We just fixed a release bug and now pip install transformer-lens should install 1.16.0 (worked in a colab for me)

Arthur Conmy 28 Apr 2024 16:59 UTC
LW: 30 AF: 16
7
AF
in reply to: Dan H’s comment on: Refusal in LLMs is mediated by a single direction
I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. “Citing others” from this Chris Olah blog post), but the authors resolved this when pushed.
But on the other hand, “Section 6.2 of the RepE paper shows exactly this” and accusations of plagiarism seem wrong @Dan H. Changing experimental setups and scaling them to larger models is valuable original work.

(Disclosure: I know all authors of the post, but wasn’t involved in this project)

(ETA: I added the word “bypassing”. Typo.)

Arthur Conmy 27 Apr 2024 0:03 UTC
LW: 5 AF: 4
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value.
For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by $E | | x | |_{2}$ , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we’ll state all the values used). We didn’t see noticable difference in the Pareto frontier changing between 0.001 and 0.00075 so did not sweep the baseline hyperparameter further than this.

Arthur Conmy 25 Apr 2024 21:29 UTC
LW: 3 AF: 2
2
AF
in reply to: Sam Marks’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Oh oops, thanks so much. We’ll update the paper accordingly. Nit: it’s actually
$\frac{E_{x \sim D} [˙ x \cdot x]}{E_{x \sim D} [| | x | |_{2}^{2}]}$

(it’s just minimizing a quadratic)

ETA: the reason we have complicated equations is that we didn’t compute $E_{x \sim D} [˙ x \cdot x]$ during training (this quantity is kinda weird). However, you can compute $γ$ from quantities that are usually tracked in SAE training. Specifically, $γ = \frac{1}{2} (1 + \frac{E [| | x ˆ | |_{2}^{2}] - E [| | x - x ˆ | |_{2}^{2}]}{E [| | x | |_{2}^{2}]})$ and all terms here are clearly helpful to track in SAE training.

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

61 points

35 comments1 min readLW link

(arxiv.org)

Arthur Conmy 21 Apr 2024 16:24 UTC
1 point
0
in reply to: Nina Rimsky’s comment on: [Full Post] Progress Update #1 from the GDM Mech Interp Team
We haven’t tried this yet. Thanks, that’s a good hypothesis.
I suspect that the mean centering paper https://arxiv.org/abs/2312.03813 is just cancelling the high frequency features, and if so this is a good explanation for why taking differences is important in activation steering.
(Though it doesn’t explain why the SAEs learn several high frequency features when trained on the residual stream)

Arthur Conmy 21 Apr 2024 16:22 UTC
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: [Full Post] Progress Update #1 from the GDM Mech Interp Team
Yes, pretty much.
There’s some work on transferring steering vecs, e.g. the Llama-2 steering paper (https://arxiv.org/abs/2312.06681) shows that you can transfer steering vecs from base to chat model, and I saw results at a Hackathon once that suggested you can train resid stream SAEs on early layers and transfer them to some later layers, too. But retraining is likely what our follow up work will do (this post only used two different SAEs)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

71 points

8 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

68 points

0 comments3 min readLW link

Arthur Conmy 15 Mar 2024 17:36 UTC
1 point
0
on: Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features
Why is CE loss >= 5.0 everywhere? Looking briefly at GELU-1L over 128 positions (a short sequence length!) I see our models get 4.3 CE loss. 5.0 seems really high?

Ah, I see your section on this, but I doubt that bad data explains all of this. Are you using a very small sequence length, or an odd dataset?

Arthur Conmy 8 Mar 2024 20:36 UTC
12 points
7
on: When and why did ‘training’ become ‘pretraining’?
From my perspective this term appeared around 2021 and became basically ubiquitous by 2022

I don’t think this is correct. To add to Steven’s answer, in the “GPT-1” paper from 2018 the abstract discusses
...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task
and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:

> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

56 points

0 comments12 min readLW link

Arthur Conmy 11 Feb 2024 0:25 UTC
3 points
0
in reply to: Sam Marks’s comment on: Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
The fact that Pythia generalizes to longer sequences but GPT-2 doesn’t isn’t very surprising to me—getting long context generalization to work is a key motivation for rotary, e.g. the original paper https://arxiv.org/abs/2104.09864

Arthur Conmy 7 Feb 2024 22:26 UTC
1 point
0
on: Some open-source dictionaries and dictionary learning infrastructure
Do you apply LR warmup immediately after doing resampling (i.e. immediately reducing the LR, and then slowly increasing it back to the normal value)? In my GELU-1L blog post I found this pretty helpful (in addition to doing LR warmup at the start of training)

Arthur Conmy

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Improving Dictionary Learning with Gated Sparse Autoencoders

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Summary] Progress Update #1 from the GDM Mech Interp Team

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To