Logan Riggs

Karma: 2,283

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

21 Sep 2023 15:30 UTC

156 points

7 comments5 min readLW link

Convincing All Capability Researchers

Logan Riggs8 Apr 2022 17:40 UTC

120 points

70 comments3 min readLW link

Trauma, Meditation, and a Cool Scar

Logan Riggs6 Aug 2019 16:17 UTC

101 points

17 comments5 min readLW link 1 review

Make a Movie Showing Alignment Failures

Logan Riggs13 Apr 2022 21:54 UTC

75 points

11 comments2 min readLW link

Really Strong Features Found in Residual Stream

Logan Riggs8 Jul 2023 19:40 UTC

68 points

6 comments2 min readLW link

Wanting to Succeed on Every Metric Presented

Logan Riggs12 Apr 2021 20:43 UTC

68 points

25 comments3 min readLW link

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

3 Sep 2020 18:27 UTC

67 points

11 comments2 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Eccentricity

9 Dec 2023 2:27 UTC

66 points

5 comments10 min readLW link

(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC

58 points

1 comment7 min readLW link

Convincing People of Alignment with Street Epistemology

Logan Riggs12 Apr 2022 23:43 UTC

54 points

4 comments3 min readLW link

Today a Tragedy

Logan Riggs13 Jun 2018 1:58 UTC

54 points

17 comments1 min readLW link

Saving the world in 80 days: Epilogue

Logan Riggs28 Jul 2018 17:04 UTC

51 points

14 comments2 min readLW link

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

45 points

4 comments1 min readLW link

Logan Riggs 25 Oct 2021 18:36 UTC
44 points
on: Self-Integrity and the Drowning Child
At a pond, my niece was in a child floaty, reached too far and flipped over into the water. I slammed my half-eaten sandwich on my brother’s chest, hoping he would grab it and ran into the water and saved her.
She was fine and I got to finish my sandwich.

Mapping Out Alignment

Logan Riggs, adamShimi, Gurkenglas, AlexMennen and Gyrodiot

15 Aug 2020 1:02 UTC

43 points

0 comments5 min readLW link

Was Releasing Claude-3 Net-Negative?

Logan Riggs27 Mar 2024 17:41 UTC

42 points

4 comments4 min readLW link

Solve Corrigibility Week

Logan Riggs28 Nov 2021 17:00 UTC

39 points

21 comments1 min readLW link

Kissing Scars

Logan Riggs9 May 2019 16:00 UTC

39 points

1 comment1 min readLW link

Sparse Autoencoders: Future Work

Logan Riggs and Aidan Ewart

21 Sep 2023 15:30 UTC

34 points

5 comments6 min readLW link

Language Model Tools for Alignment Research

Logan Riggs8 Apr 2022 17:32 UTC

28 points

0 comments2 min readLW link