Intepretability
Views my own
Intepretability
Views my own
I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733
Short explanation: Neel’s short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. “The Louvre is cool. Obama was born in” … be completed with ” Rome” too.
I agree that twitter is a worse use of time.
Going to posters for works you already know to talk to authors seems a great idea and I do it. Re-reading your OP, you suggest things like checking papers are fake or not in poster sessions. Maybe you just meant papers that you already knew about? It sounded as if you were suggesting doing this for random papers, which I’m more skeptical about.
My opinion is that going to poster sessions, orals, pre-researching papers etc. at ICML/ICLR/NeurIPS is pretty valuable for new researchers and I wish I had done this before having any papers (you don’t need to have any papers to go to a conference). See also Thomas Kwa’s comment about random intuitions learnt from going to a conference.
After this, I agree with Leo that I think it would be a waste of my time to go to papers/orals/preresearch papers. Maybe there’s some value in this for conceptual research but for most empirical work I’m very skeptical (most papers are not good, but it takes my time to figure out whether a paper is good or not, etc.)
If there are some very common features in particular layers (e.g. an ‘attend to BOS’ feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert.
+1 to similar concerns—I would have probably left one expert always on. This should both remove some redundant features.
Relevant further context: Gray Swan’s Cygnet-8B Llama finetune (which uses circuit breakers and probably other safety training too, and had impressive seeming 0.0 scores in some red teaming evals in the paper) was jailbroken in 3 hours: https://x.com/elder_plinius/status/1813670661270786137
My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).
Geminis generally search the internet, which is why they show up on LmSys without a knowledge cutoff date. Even when there’s no source attached, the model still knows information from 4 days ago via the internet (image attached). But I think in your response the [1] shows the model did find a internet source for the GUID anyway??
Unless you’re using the API here and the model is being weird? Without internet access, I expect it’s possible to coax the string out but the model refuses requests a lot so I think it would require a bit of elbow grease.
Mistral and Pythia use rotary embeddings and don’t have a positional embedding matrix. Which matrix are you looking at for those two models?
They emailed some people about this: https://x.com/brianryhuang/status/1763438814515843119
The reason is that it may allow unembedding matrix weight stealing: https://arxiv.org/abs/2403.06634
they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer
I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.
they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)
Ah yeah, Neel’s comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic’s work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.
I don’t think zero ablation is that great a baseline. We’re mostly using it for continuity’s sake with Anthropic’s prior work (and also it’s a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don’t think compute efficiency hit[1] is ideal. It’s really expensive to compute, since you can’t just calculate it from an SAE alone as you need to know facts about smaller LLMs. It also doesn’t transfer as well between sites (splicing in an attention layer SAE doesn’t impact loss much, splicing in an MLP SAE impacts loss more, and residual stream SAEs impact loss the most). Overall I expect it’s a useful expensive alternative to loss recovered, not a replacement.
EDIT: on consideration of Leo’s reply, I think my point about transfer is wrong; a metric like “compute efficiency recovered” could always be created by rescaling the compute efficiency number.
What I understand “compute efficiency hit” to mean is: for a given (SAE, ) pair, how much less compute you’d need (as a multiplier) to train a different LM, such that gets the same loss as -with-the-SAE-spliced-in.
I’m not sure what you mean by “the reinitialization approach” but feature death doesn’t seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn’t have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.
The “This should be cited” part of Dan H’s comment was edited in after the author’s reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.
On the other hand the post’s authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).
(The authors made me aware of this fact)
Yes. On the AGI safety and alignment team we are working on activation steering—e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don’t have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see “safety practices we can test right now” as a research goal.