EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Part 13 of 12 in the Engineer’s Interpretability Sequence.


On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it underperformed my expectations. I am beginning to be concerned that Anthropic’s recent approach to interpretability research might be better explained by safety washing than practical safety work.

Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.

Reflecting on predictions

See my original post for 10 specific predictions about what today’s paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identify specific and safety-relevant features should count for 3 (proofs of concept for a useful type of task) but definitely do not count for 6 (*competitively* finding and removing a harmful behavior that was represented in the training data).

Thus, my assessment is that Anthropic did 1-3 but not 4-10. I have been wrong with mech interp predictions in the past, but this time, everything I predicted with >50% probability happened, and everything I predicted with <50% probability did not happen.

The predictions were accurate in one sense. But overall, the paper underperformed expectations. If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score −0.74.

A review + thoughts

I think that Anthropic’s new SAE work has continued to be like lots of prior high-profile work on mechanistic interpretability – it has focused on presenting illustrative examples, streetlight demos, and cherry-picked proofs of concept. This is useful for science, but it does not yet show that SAEs are helpful and competitive for diagnostic and debugging tasks that could improve AI safety.

I feel increasingly worried about how Anthropic motivates and sells its interpretability research in the name of safety. Today’s paper makes some highly-generalized Motte and Bailey claims that oversell what was accomplished like “Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer,” “Sparse autoencoders produce interpretable features for large models,” and “The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.” The paper also made some omissions of past literature on interpretability illusions (e.g., Bolukbasi et al., 2021), which their methodology seems prone to. Normally, problems like this are mitigated by peer review, which Anthropic does not participate in. Meanwhile, whenever Anthropic puts out new interpretability research, I always see a laundry list of posts from the company and employees to promote it. They always seem to claim the same things – that some ‘groundbreaking new progress has been made’ and that ‘the model was even more interpretable than they thought’ but that ‘there remains progress to be made before interpretability is solved’. I won’t link to any specific person’s posts, but here are Anthropic’s posts from today and October 2023.

The way that Anthropic and its employees present their interpretability work has real-world consequences. For example, it directly led to this viral claim that interpretability will be solved and that we are bound for safe models. It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved. Meanwhile today, it seems that Anthropic orchestrated a New York Times article to be released alongside the paper, claiming to the public that exciting progress has been made (although the article also made helpful critical commentary on limitations!).

If interpretability is ever going to be helpful for safety, it will need to be useful and competitive in practical applications. This point has been made consistently for the better part of a decade (e.g. Ananny and Crawford, 2016; Lipton, 2016; Doshi-Velez and Kim, 2017; Miller, 2018; Krishnan, 2020; Rauker et al., 2022). Despite this, it seems to me that Anthropic has so far not applied its interpretability techniques to practical tasks and show that they are competitive. Instead of testing applications and beating baselines, the recent approach has been to keep focusing on streetlight demos and showing lots of cherry-picked examples. I hope to see this change soon.

I don’t think that SAE research is misguided. In my post, I pointed out 6 things that I think they could be useful for. Meanwhile, some good recent work has demonstrated proofs of concept that SAEs can be useful on certain non-cherry-picked tasks of practical value and interest (Marks et al., 2024). I think that it’s very possible that SAEs and other interpretability techniques can be lenses into models that can help us find useful clues and insights. However, Anthropic’s research on SAEs has yet to demonstrate practical usefulness over baselines that could help engineers in real applications.

I know that members of the Anthropic interpretability team have been aware of this critique. Meanwhile, Anthropic and its employees consistently affirm that their work is motivated by safety in the real world. But is it? I am starting to wonder about the extent to which the interpretability team’s current agenda is better explained by practical safety work versus doing sophisticalsafety washing to score points in social media, news, and government.

Thanks to Ryan Greenblatt and Buck Shlegris. I did not consult with them on this post, but they pointed out some useful things in a Slack thread that helped me write this.