RSS

Logan Riggs

Karma: 2,283

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
156 points
7 comments5 min readLW link

Con­vinc­ing All Ca­pa­bil­ity Researchers

Logan Riggs8 Apr 2022 17:40 UTC
120 points
70 comments3 min readLW link

Trauma, Med­i­ta­tion, and a Cool Scar

Logan Riggs6 Aug 2019 16:17 UTC
101 points
17 comments5 min readLW link1 review

Make a Movie Show­ing Align­ment Failures

Logan Riggs13 Apr 2022 21:54 UTC
75 points
11 comments2 min readLW link

Really Strong Fea­tures Found in Resi­d­ual Stream

Logan Riggs8 Jul 2023 19:40 UTC
68 points
6 comments2 min readLW link

Want­ing to Suc­ceed on Every Met­ric Presented

Logan Riggs12 Apr 2021 20:43 UTC
68 points
25 comments3 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
67 points
11 comments2 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
66 points
5 comments10 min readLW link

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC
58 points
1 comment7 min readLW link

Con­vinc­ing Peo­ple of Align­ment with Street Epistemology

Logan Riggs12 Apr 2022 23:43 UTC
54 points
4 comments3 min readLW link

To­day a Tragedy

Logan Riggs13 Jun 2018 1:58 UTC
54 points
17 comments1 min readLW link

Sav­ing the world in 80 days: Epilogue

Logan Riggs28 Jul 2018 17:04 UTC
51 points
14 comments2 min readLW link

A sur­vey of tool use and work­flows in al­ign­ment research

23 Mar 2022 23:44 UTC
45 points
4 comments1 min readLW link

Map­ping Out Alignment

15 Aug 2020 1:02 UTC
43 points
0 comments5 min readLW link

Was Re­leas­ing Claude-3 Net-Nega­tive?

Logan Riggs27 Mar 2024 17:41 UTC
42 points
4 comments4 min readLW link

Solve Cor­rigi­bil­ity Week

Logan Riggs28 Nov 2021 17:00 UTC
39 points
21 comments1 min readLW link

Kiss­ing Scars

Logan Riggs9 May 2019 16:00 UTC
39 points
1 comment1 min readLW link

Sparse Au­toen­coders: Fu­ture Work

21 Sep 2023 15:30 UTC
34 points
5 comments6 min readLW link

Lan­guage Model Tools for Align­ment Research

Logan Riggs8 Apr 2022 17:32 UTC
28 points
0 comments2 min readLW link