Anthropic’s new paper ‘Mapping the Mind of a Large Language Model’ is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).
The paper (which I’m still reading, it’s not short) updates me somewhat toward ‘SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].’ As I read I’m trying to think through what I would have to see to be convinced of that hypothesis. I’m not expert here! I’m posting my thoughts mostly to ask for feedback about where I’m wrong and/or what I’m missing. Remaining gaps I’ve thought of so far:
What’s lurking in the remaining reconstruction loss? Are there important missing features?
Will SAEs get all meaningful features given adequate dictionary size?
Are there important features which SAEs just won’t find because they’re not that sparse?
The paper points out that they haven’t rigorously investigated the sensitivity of the features, ie whether the feature reliably fires whenever relevant text/image is present; that seems like a small but meaningful gap.
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be ‘ability to predict model output given context + feature activations’?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
(as opposed to the paper under discussion, where ‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive’)
Of course LLM alignment isn’t necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I’m just thinking here about what I’d want to see to feel confident that we could use these techniques to do the LLM alignment portion.
I wrote up a short post with a summary of their results. It doesn’t really answer any of your questions. I do have thoughts on a couple, even though I’m not expert on interpretability.
But my main focus is on your footnote: is this going to help much with aligning “real” AGI (I’ve been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I’m of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end—and we haven’t really thought through which is which yet.
is this going to help much with aligning “real” AGI
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.
Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:
Aidan: ‘are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc’, and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn’t to say I don’t think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg.
Lee: Plain SAE won’t get all the important features, see recent work on e2e SAE. Also there is probably no such thing as ‘all the features’. I view it more as a continuum that we just put into discrete buckets for our convenience.
Anthropic’s new paper ‘Mapping the Mind of a Large Language Model’ is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).
The paper (which I’m still reading, it’s not short) updates me somewhat toward ‘SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].’ As I read I’m trying to think through what I would have to see to be convinced of that hypothesis. I’m not expert here! I’m posting my thoughts mostly to ask for feedback about where I’m wrong and/or what I’m missing. Remaining gaps I’ve thought of so far:
What’s lurking in the remaining reconstruction loss? Are there important missing features?
Will SAEs get all meaningful features given adequate dictionary size?
Are there important features which SAEs just won’t find because they’re not that sparse?
The paper points out that they haven’t rigorously investigated the sensitivity of the features, ie whether the feature reliably fires whenever relevant text/image is present; that seems like a small but meaningful gap.
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be ‘ability to predict model output given context + feature activations’?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
(as opposed to the paper under discussion, where ‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive’)
Of course LLM alignment isn’t necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I’m just thinking here about what I’d want to see to feel confident that we could use these techniques to do the LLM alignment portion.
I think I’d be pretty surprised if it kept working much past human-level, although I haven’t spent a ton of time thinking that through as yet.
I wrote up a short post with a summary of their results. It doesn’t really answer any of your questions. I do have thoughts on a couple, even though I’m not expert on interpretability.
But my main focus is on your footnote: is this going to help much with aligning “real” AGI (I’ve been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I’m of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end—and we haven’t really thought through which is which yet.
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.
Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:
Aidan: ‘are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc’, and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn’t to say I don’t think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg.
Lee: Plain SAE won’t get all the important features, see recent work on e2e SAE. Also there is probably no such thing as ‘all the features’. I view it more as a continuum that we just put into discrete buckets for our convenience.
Also Stephen Casper feels that this work underperformed his expectations; see also discussion on that post.