RSS

Sparse Au­toen­coders (SAEs)

TagLast edit: 6 Apr 2024 9:14 UTC by Joseph Bloom

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
289 points
22 comments2 min readLW link1 review
(transformer-circuits.pub)

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
154 points
23 comments22 min readLW link2 reviews

In­ter­pretabil­ity with Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougall29 Nov 2023 12:56 UTC
76 points
9 comments4 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
103 points
37 comments15 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
159 points
8 comments5 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
85 points
9 comments18 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
78 points
4 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
73 points
0 comments3 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
113 points
10 comments12 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
63 points
0 comments12 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
39 points
12 comments12 min readLW link

[Paper] A is for Ab­sorp­tion: Study­ing Fea­ture Split­ting and Ab­sorp­tion in Sparse Autoencoders

25 Sep 2024 9:31 UTC
73 points
16 comments3 min readLW link
(arxiv.org)

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
96 points
23 comments22 min readLW link

Do Sparse Au­toen­coders (SAEs) trans­fer across base and fine­tuned lan­guage mod­els?

29 Sep 2024 19:37 UTC
26 points
8 comments25 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
69 points
2 comments14 min readLW link

Effi­cient Dic­tionary Learn­ing with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC
118 points
20 comments12 min readLW link

My best guess at the im­por­tant tricks for train­ing 1L SAEs

Arthur Conmy21 Dec 2023 1:59 UTC
37 points
4 comments3 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
80 points
10 comments8 min readLW link

Com­ments on An­thropic’s Scal­ing Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC
98 points
8 comments7 min readLW link

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

28 Jan 2025 20:02 UTC
21 points
7 comments4 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
87 points
4 comments10 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
106 points
16 comments8 min readLW link

Scal­ing and eval­u­at­ing sparse autoencoders

leogao6 Jun 2024 22:50 UTC
106 points
6 comments1 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
95 points
7 comments7 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
67 points
0 comments10 min readLW link

Cross-Layer Fea­ture Align­ment and Steer­ing in Large Lan­guage Model

dlaptev8 Feb 2025 20:18 UTC
6 points
0 comments6 min readLW link

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
74 points
8 comments1 min readLW link

An X-Ray is Worth 15 Fea­tures: Sparse Au­toen­coders for In­ter­pretable Ra­diol­ogy Re­port Generation

7 Oct 2024 8:53 UTC
40 points
1 comment5 min readLW link
(arxiv.org)

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel Nanda23 Oct 2023 22:38 UTC
93 points
12 comments9 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
109 points
2 comments4 min readLW link

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

24 Aug 2024 0:56 UTC
72 points
10 comments20 min readLW link

In­ves­ti­gat­ing task-spe­cific prompts and sparse au­toen­coders for ac­ti­va­tion monitoring

Henk Tillman30 Apr 2025 17:09 UTC
23 points
0 comments1 min readLW link
(arxiv.org)

Bro­ken La­tents: Study­ing SAEs and Fea­ture Co-oc­cur­rence in Toy Models

30 Dec 2024 22:50 UTC
24 points
3 comments15 min readLW link

Ex­cur­sions into Sparse Au­toen­coders: What is monose­man­tic­ity?

Jakub Smékal5 Aug 2024 19:22 UTC
2 points
0 comments10 min readLW link

In­ter­pretable Fine Tun­ing Re­search Up­date and Work­ing Prototype

Matthew Khoriaty16 May 2025 3:44 UTC
9 points
0 comments4 min readLW link

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

5 Jun 2024 23:48 UTC
15 points
2 comments10 min readLW link

SAEs you can See: Ap­ply­ing Sparse Au­toen­coders to Clustering

Robert_AIZI28 Oct 2024 14:48 UTC
27 points
0 comments10 min readLW link

A gen­tle in­tro­duc­tion to sparse autoencoders

Nick Jiang2 Sep 2024 18:11 UTC
21 points
2 comments6 min readLW link

An In­tu­itive Ex­pla­na­tion of Sparse Au­toen­coders for Mechanis­tic In­ter­pretabil­ity of LLMs

Adam Karvonen25 Jun 2024 15:57 UTC
29 points
0 comments9 min readLW link
(adamkarvonen.github.io)

SAE on ac­ti­va­tion differences

30 Jun 2025 17:50 UTC
44 points
3 comments5 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Small Transformers

16 Jun 2023 18:02 UTC
52 points
0 comments5 min readLW link

The ‘strong’ fea­ture hy­poth­e­sis could be wrong

lewis smith2 Aug 2024 14:33 UTC
235 points
20 comments17 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
55 points
10 comments1 min readLW link
(storage.googleapis.com)

Proof-of-Con­cept De­bug­ger for a Small LLM

17 Mar 2025 22:27 UTC
27 points
0 comments11 min readLW link

Com­par­ing the effec­tive­ness of top-down and bot­tom-up ac­ti­va­tion steer­ing for by­pass­ing re­fusal on harm­ful prompts

Ana Kapros12 Feb 2025 19:12 UTC
7 points
0 comments5 min readLW link

HDBSCAN is Sur­pris­ingly Effec­tive at Find­ing In­ter­pretable Clusters of the SAE De­coder Matrix

11 Oct 2024 23:06 UTC
8 points
2 comments10 min readLW link

[Linkpost] Play with SAEs on Llama 3

25 Sep 2024 22:35 UTC
40 points
2 comments1 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David Udell9 Jul 2024 22:06 UTC
53 points
7 comments7 min readLW link

How to Bet­ter Re­port Sparse Au­toen­coder Performance

J Bostock2 Jun 2024 19:34 UTC
20 points
4 comments3 min readLW link

Anal­y­sis of Vari­a­tional Sparse Autoencoders

Zach Baker23 Aug 2025 23:58 UTC
11 points
0 comments10 min readLW link

Case Study: In­ter­pret­ing, Ma­nipu­lat­ing, and Con­trol­ling CLIP With Sparse Autoencoders

Gytis Daujotas1 Aug 2024 21:08 UTC
45 points
7 comments7 min readLW link

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
44 points
4 comments9 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

To­k­enized SAEs: In­fus­ing per-to­ken bi­ases.

4 Aug 2024 9:17 UTC
20 points
20 comments15 min readLW link

Self-ex­plain­ing SAE features

5 Aug 2024 22:20 UTC
62 points
13 comments10 min readLW link

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

23 Aug 2024 18:52 UTC
42 points
8 comments16 min readLW link

In­ter­pret­ing and Steer­ing Fea­tures in Images

Gytis Daujotas20 Jun 2024 18:33 UTC
67 points
6 comments5 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

21 Jun 2024 12:56 UTC
33 points
3 comments19 min readLW link

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

mwatkins4 Oct 2024 22:15 UTC
46 points
5 comments6 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
139 points
16 comments25 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
30 points
4 comments24 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

Ex­plain­ing GPT-2-Small For­ward Passes with Edge-Level Au­toen­coder Circuits

22 Jul 2025 20:36 UTC
23 points
0 comments6 min readLW link

On the Prac­ti­cal Ap­pli­ca­tions of Interpretability

Nick Jiang15 Oct 2024 17:18 UTC
4 points
1 comment7 min readLW link

Towards Mul­ti­modal In­ter­pretabil­ity: Learn­ing Sparse In­ter­pretable Fea­tures in Vi­sion Transformers

hugofry29 Apr 2024 20:57 UTC
94 points
9 comments11 min readLW link

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
75 points
12 comments9 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

15 Jul 2024 17:07 UTC
33 points
2 comments7 min readLW link

Ma­tryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC
98 points
15 comments11 min readLW link

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

10 Jan 2025 11:08 UTC
86 points
11 comments17 min readLW link

Sparse Au­toen­coder Fea­tures for Clas­sifi­ca­tions and Transferability

Shan23Chen18 Feb 2025 22:14 UTC
5 points
0 comments1 min readLW link
(arxiv.org)

SAE Train­ing Dataset In­fluence in Fea­ture Match­ing and a Hy­poth­e­sis on Po­si­tion Features

Seonglae Cho26 Feb 2025 17:05 UTC
4 points
3 comments17 min readLW link

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

6 Jan 2025 4:22 UTC
19 points
0 comments12 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

5 Jul 2024 17:05 UTC
65 points
2 comments5 min readLW link

Mas­sive Ac­ti­va­tions and why <bos> is im­por­tant in To­k­enized SAE Unigrams

Louka Ewington-Pitsos5 Sep 2024 2:19 UTC
1 point
0 comments3 min readLW link

A suite of Vi­sion Sparse Au­toen­coders

27 Oct 2024 4:05 UTC
25 points
0 comments1 min readLW link

[Linkpost] In­ter­pretable Anal­y­sis of Fea­tures Found in Open-source Sparse Au­toen­coder (par­tial repli­ca­tion)

Fernando Avalos9 Sep 2024 3:33 UTC
6 points
1 comment1 min readLW link
(forum.effectivealtruism.org)

Past Tense Features

Can20 Apr 2024 14:34 UTC
12 points
0 comments4 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
57 points
1 comment7 min readLW link

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
29 points
0 comments3 min readLW link
(arxiv.org)

Find­ing Fea­tures Causally Up­stream of Refusal

14 Jan 2025 2:30 UTC
54 points
5 comments12 min readLW link

[Question] Are Sparse Au­toen­coders a good idea for AI con­trol?

Gerard Boxo26 Dec 2024 17:34 UTC
3 points
4 comments1 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
74 points
14 comments17 min readLW link

Sparse Fea­tures Through Time

Rogan Inglis24 Jun 2024 18:06 UTC
12 points
1 comment1 min readLW link
(roganinglis.io)

[Question] SAE sparse fea­ture graph us­ing only resi­d­ual layers

Jaehyuk Lim23 May 2024 13:32 UTC
0 points
3 comments1 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
18 points
0 comments5 min readLW link

Com­po­si­tion­al­ity and Am­bi­guity: La­tent Co-oc­cur­rence and In­ter­pretable Subspaces

20 Dec 2024 15:16 UTC
34 points
0 comments37 min readLW link

Do­main-spe­cific SAEs

jacob_drori7 Oct 2024 20:15 UTC
28 points
2 comments5 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
28 points
0 comments12 min readLW link

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

29 Jul 2024 9:36 UTC
43 points
0 comments5 min readLW link

The “Spar­sity vs Re­con­struc­tion Trade­off” Illusion

26 Aug 2025 4:39 UTC
13 points
0 comments4 min readLW link

Some open-source dic­tio­nar­ies and dic­tio­nary learn­ing infrastructure

Sam Marks5 Dec 2023 6:05 UTC
46 points
7 comments5 min readLW link

[PAPER] Ja­co­bian Sparse Au­toen­coders: Spar­sify Com­pu­ta­tions, Not Just Activations

Lucy Farnik26 Feb 2025 12:50 UTC
79 points
8 comments7 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Toy Models

2 Jun 2023 17:34 UTC
24 points
0 comments1 min readLW link

Nega­tive Re­sults on Group SAEs

Josh Engels6 May 2025 21:49 UTC
70 points
3 comments8 min readLW link

Some ad­di­tional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC
31 points
4 comments13 min readLW link

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

7 Nov 2024 5:22 UTC
67 points
4 comments14 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC
61 points
24 comments10 min readLW link
(aizi.substack.com)

Toy Models of Fea­ture Ab­sorp­tion in SAEs

7 Oct 2024 9:56 UTC
49 points
8 comments10 min readLW link

Sparsely-con­nected Cross-layer Transcoders

jacob_drori18 Jun 2025 17:13 UTC
45 points
3 comments12 min readLW link

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

22 Mar 2025 18:35 UTC
19 points
0 comments7 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC
6 points
9 comments1 min readLW link

Ini­tial Ex­per­i­ments Us­ing SAEs to Help De­tect AI Gen­er­ated Text

Aaron_Scher22 Jul 2024 5:16 UTC
18 points
1 comment14 min readLW link

Ex­plain­ing “Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders”

Robert_AIZI16 Jun 2023 13:59 UTC
10 points
0 comments8 min readLW link
(aizi.substack.com)

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
113 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

Em­piri­cal In­sights into Fea­ture Geom­e­try in Sparse Autoencoders

Jason Boxi Zhang24 Jan 2025 19:02 UTC
7 points
0 comments11 min readLW link

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
47 points
0 comments1 min readLW link
(arxiv.org)

Toy Models of Su­per­po­si­tion: Sim­plified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC
9 points
3 comments8 min readLW link

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

17 Aug 2024 1:16 UTC
53 points
0 comments5 min readLW link

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

19 Dec 2024 15:59 UTC
42 points
6 comments11 min readLW link

Deep sparse au­toen­coders yield in­ter­pretable fea­tures too

Armaan A. Abraham23 Feb 2025 5:46 UTC
30 points
8 comments8 min readLW link

Limi­ta­tions on the In­ter­pretabil­ity of Learned Fea­tures from Sparse Dic­tionary Learning

Tom Angsten30 Jul 2024 16:36 UTC
6 points
0 comments9 min readLW link

Sparse Au­toen­coders: Fu­ture Work

21 Sep 2023 15:30 UTC
35 points
5 comments6 min readLW link

Nor­mal­iz­ing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC
22 points
18 comments13 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
61 points
0 comments4 min readLW link

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

25 Mar 2025 14:33 UTC
22 points
0 comments18 min readLW link

An In­tro­duc­tion to SAEs and their Var­i­ants for Mech Interp

Adam Newgas19 Apr 2025 14:09 UTC
17 points
0 comments10 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
22 points
0 comments9 min readLW link

Align­ment Does Not Need to Be Opaque! An In­tro­duc­tion to Fea­ture Steer­ing with Re­in­force­ment Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC
10 points
0 comments10 min readLW link

It’s im­por­tant to know when to stop: Mechanis­tic Ex­plo­ra­tion of Gemma 2 List Generation

Gerard Boxo14 Oct 2024 17:04 UTC
9 points
0 comments6 min readLW link
(gboxo.github.io)

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC
60 points
1 comment7 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
123 points
18 comments18 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
82 points
6 comments2 min readLW link
(www.neuronpedia.org)

Can SAE steer­ing re­veal sand­bag­ging?

15 Apr 2025 12:33 UTC
35 points
3 comments4 min readLW link

A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
61 points
5 comments1 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

What is the func­tional role of SAE er­rors?

20 Jun 2025 18:11 UTC
12 points
5 comments38 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
24 points
0 comments42 min readLW link

A Bunch of Ma­tryoshka SAEs

4 Apr 2025 14:53 UTC
29 points
0 comments8 min readLW link

Train­ing a Sparse Au­toen­coder in < 30 min­utes on 16GB of VRAM us­ing an S3 cache

Louka Ewington-Pitsos24 Aug 2024 7:39 UTC
17 points
0 comments5 min readLW link

Weird Fea­tures in Protein LLMs: The Gram Lens

Jude Stiel14 Jul 2025 17:32 UTC
8 points
0 comments9 min readLW link

Topolog­i­cal Data Anal­y­sis and Mechanis­tic Interpretability

Gunnar Carlsson24 Feb 2025 19:56 UTC
16 points
4 comments7 min readLW link

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC
27 points
0 comments12 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

15 Mar 2024 16:30 UTC
26 points
5 comments4 min readLW link

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

2 Aug 2024 19:50 UTC
38 points
1 comment9 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
30 points
8 comments5 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

30 Jul 2024 21:11 UTC
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23Chen18 Feb 2025 22:16 UTC
8 points
2 comments10 min readLW link
(www.lesswrong.com)

Mechanis­tic In­ter­pretabil­ity of Llama 3.2 with Sparse Autoencoders

PaulPauls24 Nov 2024 5:45 UTC
19 points
3 comments1 min readLW link
(github.com)

Towards data-cen­tric in­ter­pretabil­ity with sparse autoencoders

15 Aug 2025 20:10 UTC
48 points
2 comments18 min readLW link

Are SAE fea­tures from the Base Model still mean­ingful to LLaVA?

Shan23Chen5 Dec 2024 19:24 UTC
5 points
2 comments10 min readLW link

Trans­former Debugger

Henk Tillman12 Mar 2024 19:08 UTC
26 points
0 comments1 min readLW link
(github.com)

L0 is not a neu­tral hyperparameter

19 Jul 2025 13:51 UTC
24 points
3 comments5 min readLW link

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejer9 May 2025 21:29 UTC
4 points
1 comment6 min readLW link

In­ter­pretabil­ity of SAE Fea­tures Rep­re­sent­ing Check in ChessGPT

Jonathan Kutasov5 Oct 2024 20:43 UTC
27 points
2 comments8 min readLW link

Re­search Re­port: Alter­na­tive spar­sity meth­ods for sparse au­toen­coders with Othel­loGPT.

Andrew Quaisley14 Jun 2024 0:57 UTC
17 points
5 comments12 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
137 points
8 comments4 min readLW link

Food, Pri­son & Ex­otic An­i­mals: Sparse Au­toen­coders De­tect 6.5x Perform­ing Youtube Thumbnails

Louka Ewington-Pitsos17 Sep 2024 3:52 UTC
6 points
2 comments7 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
29 points
4 comments7 min readLW link

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

14 Mar 2024 18:00 UTC
33 points
12 comments15 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
70 points
5 comments10 min readLW link

Faith­ful vs In­ter­pretable Sparse Au­toen­coder Evals

Louka Ewington-Pitsos12 Jul 2024 5:37 UTC
2 points
0 comments12 min readLW link

Do sparse au­toen­coders find “true fea­tures”?

Demian Till22 Feb 2024 18:06 UTC
75 points
33 comments11 min readLW link

What We Learned Try­ing to Diff Base and Chat Models (And Why It Mat­ters)

30 Jun 2025 17:17 UTC
105 points
2 comments7 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
57 points
20 comments4 min readLW link
(arxiv.org)

Ex­tract­ing SAE task fea­tures for in-con­text learning

12 Aug 2024 20:34 UTC
31 points
1 comment9 min readLW link

Quick Thoughts on Scal­ing Monosemanticity

Joel Burget23 May 2024 16:22 UTC
28 points
1 comment4 min readLW link
(transformer-circuits.pub)

SAE Prob­ing: What is it good for?

1 Nov 2024 19:23 UTC
34 points
0 comments11 min readLW link

Ac­ti­va­tion Pat­tern SVD: A pro­posal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC
15 points
2 comments2 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

27 Feb 2024 2:43 UTC
43 points
16 comments15 min readLW link

Take­aways From Our Re­cent Work on SAE Probing

3 Mar 2025 19:50 UTC
30 points
4 comments5 min readLW link
No comments.