Sparse Autoencoders (SAEs)

TagLast edit: 6 Apr 2024 9:14 UTC by Joseph Bloom

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC

289 points

22 comments2 min readLW link 1 review

(transformer-circuits.pub)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

13 Dec 2022 15:41 UTC

154 points

23 comments22 min readLW link 2 reviews

Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall29 Nov 2023 12:56 UTC

76 points

9 comments4 min readLW link

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC

103 points

37 comments15 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

21 Sep 2023 15:30 UTC

159 points

8 comments5 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

78 points

4 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

73 points

0 comments3 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

113 points

10 comments12 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

63 points

0 comments12 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

39 points

12 comments12 min readLW link

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

25 Sep 2024 9:31 UTC

73 points

16 comments3 min readLW link

(arxiv.org)

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

96 points

23 comments22 min readLW link

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini and Ciprian Florea

29 Sep 2024 19:37 UTC

26 points

8 comments25 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

69 points

2 comments14 min readLW link

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC

118 points

20 comments12 min readLW link

My best guess at the important tricks for training 1L SAEs

Arthur Conmy21 Dec 2023 1:59 UTC

37 points

4 comments3 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

80 points

10 comments8 min readLW link

Comments on Anthropic’s Scaling Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC

98 points

8 comments7 min readLW link

SAE regularization produces more interpretable models

Peter Lai and StefanHex

28 Jan 2025 20:02 UTC

21 points

7 comments4 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

16 Feb 2024 18:32 UTC

87 points

4 comments10 min readLW link

SAE reconstruction errors are (empirically) pathological

wesg29 Mar 2024 16:37 UTC

106 points

16 comments8 min readLW link

Scaling and evaluating sparse autoencoders

leogao6 Jun 2024 22:50 UTC

106 points

6 comments1 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

25 Mar 2024 21:17 UTC

95 points

7 comments7 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

Cross-Layer Feature Alignment and Steering in Large Language Model

dlaptev8 Feb 2025 20:18 UTC

6 points

0 comments6 min readLW link

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

31 Mar 2024 15:30 UTC

74 points

8 comments1 min readLW link

An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

hugofry, Ahmed Abdulaal, NMontanaBrown and a-ijishakin

7 Oct 2024 8:53 UTC

40 points

1 comment5 min readLW link

(arxiv.org)

Open Source Replication & Commentary on Anthropic’s Dictionary Learning Paper

Neel Nanda23 Oct 2023 22:38 UTC

93 points

12 comments9 min readLW link

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

1 Apr 2024 9:09 UTC

109 points

2 comments4 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

72 points

10 comments20 min readLW link

Investigating task-specific prompts and sparse autoencoders for activation monitoring

Henk Tillman30 Apr 2025 17:09 UTC

23 points

0 comments1 min readLW link

(arxiv.org)

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind and Demian Till

30 Dec 2024 22:50 UTC

24 points

3 comments15 min readLW link

Excursions into Sparse Autoencoders: What is monosemanticity?

Jakub Smékal5 Aug 2024 19:22 UTC

2 points

0 comments10 min readLW link

Interpretable Fine Tuning Research Update and Working Prototype

Matthew Khoriaty16 May 2025 3:44 UTC

9 points

0 comments4 min readLW link

SAEs Discover Meaningful Features in the IOI Task

Alex Makelov, Georg Lange and Neel Nanda

5 Jun 2024 23:48 UTC

15 points

2 comments10 min readLW link

SAEs you can See: Applying Sparse Autoencoders to Clustering

Robert_AIZI28 Oct 2024 14:48 UTC

27 points

0 comments10 min readLW link

A gentle introduction to sparse autoencoders

Nick Jiang2 Sep 2024 18:11 UTC

21 points

2 comments6 min readLW link

An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs

Adam Karvonen25 Jun 2024 15:57 UTC

29 points

0 comments9 min readLW link

(adamkarvonen.github.io)

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

44 points

3 comments5 min readLW link

[Replication] Conjecture’s Sparse Coding in Small Transformers

Hoagy and Logan Riggs

16 Jun 2023 18:02 UTC

52 points

0 comments5 min readLW link

The ‘strong’ feature hypothesis could be wrong

lewis smith2 Aug 2024 14:33 UTC

235 points

20 comments17 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

55 points

10 comments1 min readLW link

(storage.googleapis.com)

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

17 Mar 2025 22:27 UTC

27 points

0 comments11 min readLW link

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts

Ana Kapros12 Feb 2025 19:12 UTC

7 points

0 comments5 min readLW link

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

Jaehyuk Lim, Kanishk Tantia and Sinem

11 Oct 2024 23:06 UTC

8 points

2 comments10 min readLW link

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

25 Sep 2024 22:35 UTC

40 points

2 comments1 min readLW link

Causal Graphs of GPT-2-Small’s Residual Stream

David Udell9 Jul 2024 22:06 UTC

53 points

7 comments7 min readLW link

How to Better Report Sparse Autoencoder Performance

J Bostock2 Jun 2024 19:34 UTC

20 points

4 comments3 min readLW link

Analysis of Variational Sparse Autoencoders

Zach Baker23 Aug 2025 23:58 UTC

11 points

0 comments10 min readLW link

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders

Gytis Daujotas1 Aug 2024 21:08 UTC

45 points

7 comments7 min readLW link

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC

44 points

4 comments9 min readLW link

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

Tokenized SAEs: Infusing per-token biases.

tdooms and danwil

4 Aug 2024 9:17 UTC

20 points

20 comments15 min readLW link

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

5 Aug 2024 22:20 UTC

62 points

13 comments10 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

42 points

8 comments16 min readLW link

Interpreting and Steering Features in Images

Gytis Daujotas20 Jun 2024 18:33 UTC

67 points

6 comments5 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

33 points

3 comments19 min readLW link

Exploring SAE features in LLMs with definition trees and token lists

mwatkins4 Oct 2024 22:15 UTC

46 points

5 comments6 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda7 Jul 2024 17:39 UTC

139 points

16 comments25 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneill24 Mar 2024 20:05 UTC

30 points

4 comments24 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

63 points

38 comments1 min readLW link

(arxiv.org)

Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

David Udell, hrdkbhatnagar and JacksonKaunismaa

22 Jul 2025 20:36 UTC

23 points

0 comments6 min readLW link

On the Practical Applications of Interpretability

Nick Jiang15 Oct 2024 17:18 UTC

4 points

1 comment7 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

94 points

9 comments11 min readLW link

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

1 Jul 2024 21:35 UTC

75 points

12 comments9 min readLW link

Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen and Mateusz Dziemian

15 Jul 2024 17:07 UTC

33 points

2 comments7 min readLW link

Matryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC

98 points

15 comments11 min readLW link

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

10 Jan 2025 11:08 UTC

86 points

11 comments17 min readLW link

Sparse Autoencoder Features for Classifications and Transferability

Shan23Chen18 Feb 2025 22:14 UTC

5 points

0 comments1 min readLW link

(arxiv.org)

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae Cho26 Feb 2025 17:05 UTC

4 points

3 comments17 min readLW link

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

6 Jan 2025 4:22 UTC

19 points

0 comments12 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

5 Jul 2024 17:05 UTC

65 points

2 comments5 min readLW link

Massive Activations and why <bos> is important in Tokenized SAE Unigrams

Louka Ewington-Pitsos5 Sep 2024 2:19 UTC

1 point

0 comments3 min readLW link

A suite of Vision Sparse Autoencoders

Louka Ewington-Pitsos and RRGoyal

27 Oct 2024 4:05 UTC

25 points

0 comments1 min readLW link

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)

Fernando Avalos9 Sep 2024 3:33 UTC

6 points

1 comment1 min readLW link

(forum.effectivealtruism.org)

Past Tense Features

Can20 Apr 2024 14:34 UTC

12 points

0 comments4 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC

57 points

1 comment7 min readLW link

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

25 Sep 2024 20:37 UTC

29 points

0 comments3 min readLW link

(arxiv.org)

Finding Features Causally Upstream of Refusal

Daniel Lee, Eric Breck and Andy Arditi

14 Jan 2025 2:30 UTC

54 points

5 comments12 min readLW link

[Question] Are Sparse Autoencoders a good idea for AI control?

Gerard Boxo26 Dec 2024 17:34 UTC

3 points

4 comments1 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

74 points

14 comments17 min readLW link

Sparse Features Through Time

Rogan Inglis24 Jun 2024 18:06 UTC

12 points

1 comment1 min readLW link

(roganinglis.io)

[Question] SAE sparse feature graph using only residual layers

Jaehyuk Lim23 May 2024 13:32 UTC

0 points

3 comments1 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

3 Oct 2023 7:45 UTC

18 points

0 comments5 min readLW link

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, hrdkbhatnagar and Joseph Bloom

20 Dec 2024 15:16 UTC

34 points

0 comments37 min readLW link

Domain-specific SAEs

jacob_drori7 Oct 2024 20:15 UTC

28 points

2 comments5 min readLW link

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

6 Sep 2024 2:28 UTC

28 points

0 comments12 min readLW link

Understanding Positional Features in Layer 0 SAEs

bilalchughtai and Yeu-Tong Lau

29 Jul 2024 9:36 UTC

43 points

0 comments5 min readLW link

The “Sparsity vs Reconstruction Tradeoff” Illusion

chanind and Adrià Garriga-alonso

26 Aug 2025 4:39 UTC

13 points

0 comments4 min readLW link

Some open-source dictionaries and dictionary learning infrastructure

Sam Marks5 Dec 2023 6:05 UTC

46 points

7 comments5 min readLW link

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik26 Feb 2025 12:50 UTC

79 points

8 comments7 min readLW link

[Replication] Conjecture’s Sparse Coding in Toy Models

Hoagy and Logan Riggs

2 Jun 2023 17:34 UTC

24 points

0 comments1 min readLW link

Negative Results on Group SAEs

Josh Engels6 May 2025 21:49 UTC

70 points

3 comments8 min readLW link

Some additional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC

31 points

4 comments13 min readLW link

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

7 Nov 2024 5:22 UTC

67 points

4 comments14 min readLW link

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC

61 points

24 comments10 min readLW link

(aizi.substack.com)

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

7 Oct 2024 9:56 UTC

49 points

8 comments10 min readLW link

Sparsely-connected Cross-layer Transcoders

jacob_drori18 Jun 2025 17:13 UTC

45 points

3 comments12 min readLW link

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree and Jason Gross

22 Mar 2025 18:35 UTC

19 points

0 comments7 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC

6 points

9 comments1 min readLW link

Initial Experiments Using SAEs to Help Detect AI Generated Text

Aaron_Scher22 Jul 2024 5:16 UTC

18 points

1 comment14 min readLW link

Explaining “Taking features out of superposition with sparse autoencoders”

Robert_AIZI16 Jun 2023 13:59 UTC

10 points

0 comments8 min readLW link

(aizi.substack.com)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

113 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi Zhang24 Jan 2025 19:02 UTC

7 points

0 comments11 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

7 Nov 2024 22:07 UTC

47 points

0 comments1 min readLW link

(arxiv.org)

Toy Models of Superposition: Simplified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC

9 points

3 comments8 min readLW link

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

17 Aug 2024 1:16 UTC

53 points

0 comments5 min readLW link

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

19 Dec 2024 15:59 UTC

42 points

6 comments11 min readLW link

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham23 Feb 2025 5:46 UTC

30 points

8 comments8 min readLW link

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Tom Angsten30 Jul 2024 16:36 UTC

6 points

0 comments9 min readLW link

Sparse Autoencoders: Future Work

Logan Riggs and Aidan Ewart

21 Sep 2023 15:30 UTC

35 points

5 comments6 min readLW link

Normalizing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC

22 points

18 comments13 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

61 points

0 comments4 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

25 Mar 2025 14:33 UTC

22 points

0 comments18 min readLW link

An Introduction to SAEs and their Variants for Mech Interp

Adam Newgas19 Apr 2025 14:09 UTC

17 points

0 comments10 min readLW link

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

14 Nov 2024 13:06 UTC

22 points

0 comments9 min readLW link

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC

10 points

0 comments10 min readLW link

It’s important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation

Gerard Boxo14 Oct 2024 17:04 UTC

9 points

0 comments6 min readLW link

(gboxo.github.io)

(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC

60 points

1 comment7 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

123 points

18 comments18 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

15 Apr 2025 12:33 UTC

35 points

3 comments4 min readLW link

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

30 Apr 2023 19:54 UTC

61 points

5 comments1 min readLW link

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David Udell23 Sep 2023 19:16 UTC

42 points

7 comments34 min readLW link

What is the functional role of SAE errors?

Taras Kutsyk, Tim Hua, woog and anogassis

20 Jun 2025 18:11 UTC

12 points

5 comments38 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

14 Jan 2024 2:06 UTC

24 points

0 comments42 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

4 Apr 2025 14:53 UTC

29 points

0 comments8 min readLW link

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache

Louka Ewington-Pitsos24 Aug 2024 7:39 UTC

17 points

0 comments5 min readLW link

Weird Features in Protein LLMs: The Gram Lens

Jude Stiel14 Jul 2025 17:32 UTC

8 points

0 comments9 min readLW link

Topological Data Analysis and Mechanistic Interpretability

Gunnar Carlsson24 Feb 2025 19:56 UTC

16 points

4 comments7 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC

27 points

0 comments12 min readLW link

Classifying representations of sparse autoencoders (SAEs)

Annah17 Nov 2023 13:54 UTC

15 points

6 comments2 min readLW link

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs and Jannik Brinkmann

15 Mar 2024 16:30 UTC

26 points

5 comments4 min readLW link

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

2 Aug 2024 19:50 UTC

38 points

1 comment9 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

30 points

8 comments5 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

30 Jul 2024 21:11 UTC

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen18 Feb 2025 22:16 UTC

8 points

2 comments10 min readLW link

(www.lesswrong.com)

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

PaulPauls24 Nov 2024 5:45 UTC

19 points

3 comments1 min readLW link

(github.com)

Towards data-centric interpretability with sparse autoencoders

Nick Jiang, lilysun004, lewis smith and Neel Nanda

15 Aug 2025 20:10 UTC

48 points

2 comments18 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen5 Dec 2024 19:24 UTC

5 points

2 comments10 min readLW link

Transformer Debugger

Henk Tillman12 Mar 2024 19:08 UTC

26 points

0 comments1 min readLW link

(github.com)

L0 is not a neutral hyperparameter

chanind and Adrià Garriga-alonso

19 Jul 2025 13:51 UTC

24 points

3 comments5 min readLW link

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejer9 May 2025 21:29 UTC

4 points

1 comment6 min readLW link

Interpretability of SAE Features Representing Check in ChessGPT

Jonathan Kutasov5 Oct 2024 20:43 UTC

27 points

2 comments8 min readLW link

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

Andrew Quaisley14 Jun 2024 0:57 UTC

17 points

5 comments12 min readLW link

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZI7 Oct 2023 23:30 UTC

137 points

8 comments4 min readLW link

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails

Louka Ewington-Pitsos17 Sep 2024 3:52 UTC

6 points

2 comments7 min readLW link

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

12 Oct 2024 14:54 UTC

29 points

4 comments7 min readLW link

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier and Jessica N. Howard

14 Mar 2024 18:00 UTC

33 points

12 comments15 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

9 Dec 2023 2:27 UTC

70 points

5 comments10 min readLW link

Faithful vs Interpretable Sparse Autoencoder Evals

Louka Ewington-Pitsos12 Jul 2024 5:37 UTC

2 points

0 comments12 min readLW link

Do sparse autoencoders find “true features”?

Demian Till22 Feb 2024 18:06 UTC

75 points

33 comments11 min readLW link

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas, Julian Minder and Neel Nanda

30 Jun 2025 17:17 UTC

105 points

2 comments7 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

57 points

20 comments4 min readLW link

(arxiv.org)

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

12 Aug 2024 20:34 UTC

31 points

1 comment9 min readLW link

Quick Thoughts on Scaling Monosemanticity

Joel Burget23 May 2024 16:22 UTC

28 points

1 comment4 min readLW link

(transformer-circuits.pub)

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

1 Nov 2024 19:23 UTC

34 points

0 comments11 min readLW link

Activation Pattern SVD: A proposal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC

15 points

2 comments2 min readLW link

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders and Joseph Bloom

27 Feb 2024 2:43 UTC

43 points

16 comments15 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2025 19:50 UTC

30 points

4 comments5 min readLW link

No comments.

Sparse Au­toen­coders (SAEs)

Sparse Autoencoders (SAEs)