Arthur Conmy

Karma: 1,139

Intepretability

Views my own

Arthur Conmy 25 Jul 2024 13:54 UTC
2 points
0
in reply to: Jan_Kulveit’s comment on: You should go to ML conferences
I agree that twitter is a worse use of time.
Going to posters for works you already know to talk to authors seems a great idea and I do it. Re-reading your OP, you suggest things like checking papers are fake or not in poster sessions. Maybe you just meant papers that you already knew about? It sounded as if you were suggesting doing this for random papers, which I’m more skeptical about.

Arthur Conmy 25 Jul 2024 12:03 UTC
4 points
−2
on: You should go to ML conferences
My opinion is that going to poster sessions, orals, pre-researching papers etc. at ICML/ICLR/NeurIPS is pretty valuable for new researchers and I wish I had done this before having any papers (you don’t need to have any papers to go to a conference). See also Thomas Kwa’s comment about random intuitions learnt from going to a conference.

After this, I agree with Leo that I think it would be a waste of my time to go to papers/orals/preresearch papers. Maybe there’s some value in this for conceptual research but for most empirical work I’m very skeptical (most papers are not good, but it takes my time to figure out whether a paper is good or not, etc.)

Arthur Conmy 23 Jul 2024 11:06 UTC
6 points
0
in reply to: Lee Sharkey’s comment on: Efficient Dictionary Learning with Switch Sparse Autoencoders
If there are some very common features in particular layers (e.g. an ‘attend to BOS’ feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert.
+1 to similar concerns—I would have probably left one expert always on. This should both remove some redundant features.

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

44 points

8 comments1 min readLW link

(storage.googleapis.com)

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

40 points

0 comments10 min readLW link

Arthur Conmy 18 Jul 2024 8:00 UTC
5 points
0
on: Breaking Circuit Breakers
Relevant further context: Gray Swan’s Cygnet-8B Llama finetune (which uses circuit breakers and probably other safety training too, and had impressive seeming 0.0 scores in some red teaming evals in the paper) was jailbroken in 3 hours: https://x.com/elder_plinius/status/1813670661270786137

Arthur Conmy 15 Jul 2024 17:46 UTC
4 points
0
in reply to: Jacob Pfau’s comment on: Breaking Circuit Breakers
My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).

Arthur Conmy 10 Jul 2024 19:23 UTC
9 points
0
in reply to: Linch’s comment on: niplav’s Shortform
Geminis generally search the internet, which is why they show up on LmSys without a knowledge cutoff date. Even when there’s no source attached, the model still knows information from 4 days ago via the internet (image attached). But I think in your response the [1] shows the model did find a internet source for the GUID anyway??

Unless you’re using the API here and the model is being weird? Without internet access, I expect it’s possible to coax the string out but the model refuses requests a lot so I think it would require a bit of elbow grease.

Arthur Conmy 10 Jul 2024 0:58 UTC
4 points
2
in reply to: asher’s comment on: GPT-2′s positional embedding matrix is a helix
Mistral and Pythia use rotary embeddings and don’t have a positional embedding matrix. Which matrix are you looking at for those two models?

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

31 points

0 comments19 min readLW link

Arthur Conmy 19 May 2024 1:05 UTC
2 points
0
in reply to: eggsyntax’s comment on: Language Models Model Us
They emailed some people about this: https://x.com/brianryhuang/status/1763438814515843119
The reason is that it may allow unembedding matrix weight stealing: https://arxiv.org/abs/2403.06634

Arthur Conmy 17 May 2024 23:33 UTC
LW: 3 AF: 1
0
AF
on: Transcoders enable fine-grained interpretable circuit analysis for language models
they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer
I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.
they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)

Arthur Conmy 1 May 2024 1:06 UTC
LW: 2 AF: 1
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Ah yeah, Neel’s comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic’s work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.

Arthur Conmy 1 May 2024 1:02 UTC
LW: 4 AF: 1
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
I don’t think zero ablation is that great a baseline. We’re mostly using it for continuity’s sake with Anthropic’s prior work (and also it’s a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don’t think compute efficiency hit^[1] is ideal. It’s really expensive to compute, since you can’t just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn’t transfer as well between sites (splicing in an attention layer SAE doesn’t impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it’s a useful expensive alternative to loss recovered, not a replacement.

EDIT: on consideration of Leo’s reply, I think my point about transfer is wrong; a metric like “compute efficiency recovered” could always be created by rescaling the compute efficiency number.
1. ^
  What I understand “compute efficiency hit” to mean is: for a given (SAE, $L M_{1}$ ) pair, how much less compute you’d need (as a multiplier) to train a different LM, $L M_{2}$ such that $L M_{2}$ gets the same loss as $L M_{1}$ -with-the-SAE-spliced-in.
What links here?
- SAEs (usually) Transfer Between Base and Chat Models by Connor Kissane (18 Jul 2024 10:29 UTC; 40 points)

Arthur Conmy 1 May 2024 0:49 UTC
LW: 2 AF: 1
0
AF
in reply to: leogao’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
I’m not sure what you mean by “the reinitialization approach” but feature death doesn’t seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn’t have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.

Arthur Conmy 30 Apr 2024 22:58 UTC
LW: 27 AF: 16
32
AF
in reply to: Arthur Conmy’s comment on: Refusal in LLMs is mediated by a single direction
The “This should be cited” part of Dan H’s comment was edited in after the author’s reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.

On the other hand the post’s authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).

(The authors made me aware of this fact)

Arthur Conmy 29 Apr 2024 21:25 UTC
14 points
9
on: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
Awesome work! I notice I am surprised that this just worked given just 1M datapoints (we use 1000x this with LMs, even small ones), and not needing any new techniques, and producing subjectively extremely abstract features (IMO).
It would be nice if the “guess the image” game was presented as a result rather than a fun side thing in this post. AFAICT that’s the only interpretability result that can’t be critiqued as cherry-picked. You should state front and center that the top features for arbitrary images are basically interpretable, it’s a great result!

Arthur Conmy 29 Apr 2024 18:17 UTC
3 points
0
in reply to: Dan Braun’s comment on: Improving Dictionary Learning with Gated Sparse Autoencoders
Thanks for the feedback, we will put up an update to the paper with all these numbers in tables, tomorrow night. For now I have sent you them (and can send anyone else them who wants them in the next 24H)

Arthur Conmy 28 Apr 2024 23:57 UTC
2 points
0
in reply to: magnetoid’s comment on: Refusal in LLMs is mediated by a single direction
+1 to Neel. We just fixed a release bug and now pip install transformer-lens should install 1.16.0 (worked in a colab for me)

Arthur Conmy 28 Apr 2024 16:59 UTC
LW: 31 AF: 16
8
AF
in reply to: Dan H’s comment on: Refusal in LLMs is mediated by a single direction
I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. “Citing others” from this Chris Olah blog post), but the authors resolved this when pushed.
But on the other hand, “Section 6.2 of the RepE paper shows exactly this” and accusations of plagiarism seem wrong @Dan H. Changing experimental setups and scaling them to larger models is valuable original work.

(Disclosure: I know all authors of the post, but wasn’t involved in this project)

(ETA: I added the word “bypassing”. Typo.)

Arthur Conmy

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

SAEs (usu­ally) Trans­fer Between Base and Chat Models

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

JumpReLU SAEs + Early Access to Gemma 2 SAEs

SAEs (usually) Transfer Between Base and Chat Models

Attention Output SAEs Improve Circuit Analysis