SAE vs. RepE
I recently read a post by Dan Hendrycks from xAI criticizing Anthropic’s focus on Sparse Auto-Encoders as a tool for mechanistic interpretability.
You can find that post HERE. Some salient quotes below.
On SAEs:
Another technique initially hailed as a breakthrough is sparse autoencoders (SAEs) [...] getting SAEs to work reliably has proven challenging, possibly because some concepts are distributed too widely across the network, or perhaps because the model’s operations are not founded on a neat set of human-understandable concepts at all. This is the field that DeepMind recently deprioritized, noting that their SAE research had yielded disappointing results. In fact, given the task of detecting harmful intent in user inputs, SAEs underperformed a simple baseline.
[...] despite substantial efforts over the past decade, the returns from interpretability have been roughly nonexistent. To avoid overinvesting in ideas that are unlikely to work, potentially to the neglect of more effective ones, we should be more skeptical in the future about heavily prioritizing mechanistic interpretability over other types of AI research.
On RepE:
Representation engineering (RepE) is a promising emerging field that takes this view. Focusing on representations as the primary units of analysis – as opposed to neurons or circuits – this area finds meaning in the patterns of activity across many neurons.
A strong argument for this approach is the fact that models often largely retain the same overall behaviors even if entire layers of their structure are removed. As well as demonstrating their remarkable flexibility and adaptability – not unlike that of the brain – this indicates that the individual components in isolation offer far fewer insights than the organization between them. In fact, because of emergence, analyzing complex systems at a higher level is often enough to understand or predict their behavior, while detailed lower-level inspection can be unnecessary or even misleading.
RepE can identify, amplify, and suppress characteristics. RepE helps manipulate model internals to control them and make them safer. Since the original RepE paper, we have used RepE to make models unlearn dual-use concepts, be more honest, be more robust to adversarial attacks, edit AIs’ values, and more.
I’m not sure why the article seems to be taking the framing that SAEs and RepE cannot co-exist as safety methods, if I were taking the most charitable interpretation of his point I think that Dan is arguing that investment should focus on RepE over SAEs.
However from my perspective there have been some encouraging results with SAEs. Golden Gate Claude, or even just being able to “clamp” and/or “upweight” circuits like we saw in the Dallas example of biology of an LLM would both seem to indicate that SAEs features are on the right track to interpretability for at least some concepts. Ultimately I don’t see why RepE and SAEs can’t both be valuable tools.
I would like to hear what other people think about this criticism. Is it valid and focus should no longer be put into SAEs, or is there enough “there there” that it’s still an avenue worth pursuing?
“focus should no longer be put into SAEs...?”
I think we should still invest research into them BUT it depends on the research.
Less interesting research:
1. Applying SAEs to [X-model/field] (or Y-problem w/o any baselines)
More interesting research:
Problems w/ SAEs & possible solutions
Feature supression (solved by post-training, gated-SAEs, & top-k)
Feature absorption (possibly solved by Matryoshka SAEs)
SAE’s don’t find the same features across seeds (maybe solved by constraining latents to the convex hull of the data)
Dark-matter of SAEs (nothing AFAIK)
Many more I’m likely forgetting/haven’t read
Comparing SAEs w/ strong baselines for solving specific problems
Using SAEs to test how true the linear representation hypothesis is
Changing SAE architecture to match the data
In general, I’m still excited about an unsupervised method that finds all the model’s features/functions. SAE’s are one possible option, but others are being worked on! (APD & L3D for a weight-based method)
Relatedly, I’m also excited about interpretable-from-scratch architectures that do lend themselves more towards mech-interp (or bottom-up in Dan’s language).
While I don’t have time to address the Dan Hendrycks & RepE questions, I want to link you to this post from the GDM mech interp team. It gives a good critique of SAEs in terms of making downstream progress.
Just on the Dallas example, look at this +8x & −2x below
So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is “under” China, meaning it’s being “replaced”) by −2x. That’s really weird! It should be multiplying Texas node by 0. If Texas is upweighting “Austin”, then −2x-ing it could be downweighting “Austin”, leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SAE latents didn’t cleanly separate the features (we think exist).
(With that said, in their paper itself, they’re very careful & don’t overclaim what their work shows; I believe it’s a great paper overall!)
I will set aside the question of resource allocation for others to decide, and just note that there is actually another branch of interpretability research that can (at least in principle) be used in conjunction with the other approaches, addressing a fundamental limitation of these approaches: Namely, that for which the focus is deriving robust estimators of the predictive uncertainty, conditioned on controlling for the representation space of the models over the available observed data. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1
The reason this is a unifying method is that once we control for the uncertainty, we then have non-vacuous controls that the inductive bias of the semi-supervised methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.