RSS

Sparse Au­toen­coders (SAEs)

TagLast edit: 6 Apr 2024 9:14 UTC by Joseph Bloom

Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas.

For more information on SAEs see:

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
286 points
21 comments2 min readLW link
(transformer-circuits.pub)

In­tro to Su­per­po­si­tion & Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougall29 Nov 2023 12:56 UTC
65 points
8 comments3 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
82 points
5 comments19 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
76 points
4 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
68 points
0 comments3 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
93 points
5 comments12 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
56 points
0 comments12 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
156 points
7 comments5 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
137 points
22 comments22 min readLW link2 reviews

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
53 points
0 comments14 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
85 points
20 comments22 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
92 points
37 comments15 min readLW link

My best guess at the im­por­tant tricks for train­ing 1L SAEs

Arthur Conmy21 Dec 2023 1:59 UTC
35 points
4 comments3 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
86 points
15 comments8 min readLW link

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
73 points
8 comments1 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
80 points
3 comments10 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
70 points
8 comments8 min readLW link

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel Nanda23 Oct 2023 22:38 UTC
91 points
12 comments9 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
106 points
2 comments4 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
86 points
5 comments7 min readLW link

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
29 points
2 comments8 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
41 points
19 comments1 min readLW link
(arxiv.org)

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
28 points
3 comments24 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
54 points
1 comment7 min readLW link

Some open-source dic­tio­nar­ies and dic­tio­nary learn­ing infrastructure

Sam Marks5 Dec 2023 6:05 UTC
45 points
7 comments5 min readLW link

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC
58 points
1 comment7 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC
52 points
24 comments10 min readLW link
(aizi.substack.com)

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

14 Mar 2024 18:00 UTC
28 points
10 comments15 min readLW link

Clas­sify­ing rep­re­sen­ta­tions of sparse au­toen­coders (SAEs)

Annah17 Nov 2023 13:54 UTC
15 points
6 comments2 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
29 points
8 comments5 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
11 points
0 comments5 min readLW link

Ex­plain­ing “Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders”

Robert_AIZI16 Jun 2023 13:59 UTC
9 points
0 comments8 min readLW link
(aizi.substack.com)

Ex­per­i­ments with an al­ter­na­tive method to pro­mote spar­sity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC
28 points
7 comments12 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
136 points
8 comments4 min readLW link

A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
61 points
5 comments1 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Toy Models

2 Jun 2023 17:34 UTC
23 points
0 comments1 min readLW link

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Small Transformers

16 Jun 2023 18:02 UTC
52 points
0 comments5 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
68 points
5 comments10 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

Ex­plainer—Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Gauraventh1 Aug 2023 17:29 UTC
8 points
0 comments3 min readLW link

Trans­former Debugger

Henk Tillman12 Mar 2024 19:08 UTC
25 points
0 comments1 min readLW link
(github.com)

Some ad­di­tional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC
28 points
4 comments13 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
22 points
0 comments42 min readLW link

Nor­mal­iz­ing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC
12 points
17 comments13 min readLW link

[Re­search Up­date] Sparse Au­toen­coder fea­tures are bimodal

Robert_AIZI22 Jun 2023 13:15 UTC
23 points
1 comment5 min readLW link
(aizi.substack.com)

Past Tense Features

Can20 Apr 2024 14:34 UTC
11 points
0 comments4 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

27 Feb 2024 2:43 UTC
39 points
16 comments15 min readLW link

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

15 Mar 2024 16:30 UTC
12 points
5 comments4 min readLW link

Sparse Au­toen­coders: Fu­ture Work

21 Sep 2023 15:30 UTC
34 points
5 comments6 min readLW link

Do sparse au­toen­coders find “true fea­tures”?

Demian Till22 Feb 2024 18:06 UTC
70 points
33 comments11 min readLW link
No comments.