RSS

In­ter­pretabil­ity (ML & AI)

TagLast edit: 10 Nov 2023 13:11 UTC by niplav

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

See Also

Research

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
35 points
6 comments2 min readLW link
(danielfilan.com)

Un­der­stand­ing understanding

mthq23 Aug 2019 18:10 UTC
24 points
1 comment2 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
69 points
27 comments27 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
104 points
39 comments3 min readLW link2 reviews

Ro­hin Shah on rea­sons for AI optimism

abergal31 Oct 2019 12:10 UTC
40 points
58 comments1 min readLW link
(aiimpacts.org)

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
206 points
38 comments12 min readLW link2 reviews

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew Barnett4 Nov 2019 20:52 UTC
43 points
5 comments7 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
54 points
45 comments8 min readLW link
(danielfilan.com)

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
35 points
0 comments11 min readLW link

How can In­ter­pretabil­ity help Align­ment?

23 May 2020 16:16 UTC
37 points
3 comments9 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
53 points
5 comments15 min readLW link

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
100 points
40 comments36 min readLW link1 review

in­ter­pret­ing GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC
202 points
34 comments11 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
67 points
11 comments2 min readLW link

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
137 points
39 comments11 min readLW link2 reviews

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
54 points
12 comments30 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
142 points
13 comments26 min readLW link

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
25 points
2 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
63 points
7 comments26 min readLW link

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

9 Apr 2021 19:19 UTC
141 points
17 comments102 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
113 comments2 min readLW link
(danielfilan.com)

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
100 points
7 comments2 min readLW link
(arxiv.org)

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald Hobson7 Jun 2021 21:13 UTC
31 points
6 comments2 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

Jsevillamol29 Jun 2021 17:20 UTC
18 points
2 comments9 min readLW link

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
60 points
10 comments47 min readLW link

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC
21 points
2 comments12 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

Interpretability

29 Oct 2021 7:28 UTC
59 points
13 comments12 min readLW link

Com­ments on OpenPhil’s In­ter­pretabil­ity RFP

paulfchristiano5 Nov 2021 22:36 UTC
91 points
5 comments7 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
30 points
8 comments9 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven Byrnes7 Dec 2021 20:46 UTC
49 points
10 comments2 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

Teaser: Hard-cod­ing Trans­former Models

MadHatter12 Dec 2021 22:04 UTC
74 points
19 comments1 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatter13 Dec 2021 4:35 UTC
34 points
8 comments27 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan Riggs13 Dec 2021 17:09 UTC
11 points
5 comments1 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC
37 points
8 comments19 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
127 points
9 comments15 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
66 points
16 comments8 min readLW link1 review

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
144 points
4 comments3 min readLW link
(transformer-circuits.pub)

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC
11 points
2 comments1 min readLW link
(www.youtube.com)

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
85 points
12 comments11 min readLW link1 review

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
127 points
35 comments5 min readLW link2 reviews

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC
11 points
0 comments2 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
44 points
11 comments21 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
70 points
4 comments6 min readLW link

Progress Re­port 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC
4 points
1 comment1 min readLW link

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC
5 points
0 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Take­aways From 3 Years Work­ing In Ma­chine Learning

George3d68 Apr 2022 17:14 UTC
34 points
10 comments11 min readLW link
(www.epistem.ink)

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
33 points
1 comment5 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:35 UTC
125 points
20 comments3 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
53 points
0 comments59 min readLW link

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHex26 May 2022 11:02 UTC
17 points
4 comments5 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworth4 Jun 2022 5:41 UTC
142 points
53 comments2 min readLW link1 review

Thoughts on For­mal­iz­ing Composition

Tom Lieberum7 Jun 2022 7:51 UTC
13 points
0 comments7 min readLW link

Re­search Ques­tions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC
4 points
0 comments2 min readLW link

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworth9 Jun 2022 19:39 UTC
60 points
14 comments3 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC
4 points
0 comments2 min readLW link

[Question] Can you MRI a deep learn­ing model?

Yair Halberstadt13 Jun 2022 13:43 UTC
3 points
3 comments1 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
163 points
11 comments18 min readLW link1 review

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
21 points
1 comment4 min readLW link
(transformer-circuits.pub)

Deep neu­ral net­works are not opaque.

jem-mosig6 Jul 2022 18:03 UTC
22 points
14 comments3 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
50 points
15 comments8 min readLW link

Vi­su­al­iz­ing Neu­ral net­works, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC
7 points
1 comment6 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
112 points
12 comments33 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
18 points
0 comments37 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

[Question] How op­ti­mistic should we be about AI figur­ing out how to in­ter­pret it­self?

oh5432125 Jul 2022 22:09 UTC
3 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
24 points
0 comments14 min readLW link

In­ter­pretabil­ity isn’t Free

Joel Burget4 Aug 2022 15:02 UTC
10 points
1 comment2 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
129 points
13 comments3 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
174 points
33 comments3 min readLW link1 review

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar Jackson14 Aug 2022 10:31 UTC
6 points
1 comment5 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
368 points
47 comments36 min readLW link1 review
(colab.research.google.com)

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
21 points
2 comments16 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
14 comments1 min readLW link

What Makes A Good Mea­sure­ment De­vice?

johnswentworth24 Aug 2022 22:45 UTC
37 points
7 comments2 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
120 points
48 comments1 min readLW link

In­for­mal se­man­tics and Orders

Q Home27 Aug 2022 4:17 UTC
14 points
10 comments26 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points
3 comments14 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasper12 Sep 2022 19:07 UTC
97 points
7 comments2 min readLW link
(arxiv.org)

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

Co­or­di­nate-Free In­ter­pretabil­ity Theory

johnswentworth14 Sep 2022 23:33 UTC
50 points
16 comments5 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllis17 Sep 2022 19:48 UTC
19 points
13 comments3 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC
68 points
4 comments5 min readLW link1 review
(transformer-circuits.pub)

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 3:48 UTC
34 points
4 comments1 min readLW link
(www.youtube.com)

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
136 points
29 comments33 min readLW link

[Question] Papers to start get­ting into NLP-fo­cused al­ign­ment research

Feraidoon24 Sep 2022 23:53 UTC
6 points
0 comments1 min readLW link

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin Pope28 Sep 2022 16:02 UTC
58 points
2 comments10 min readLW link

Vi­su­al­iz­ing Learned Rep­re­sen­ta­tions of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC
7 points
0 comments4 min readLW link
(indecisive-sand-24a.notion.site)

Re­call and Re­gur­gi­ta­tion in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC
43 points
1 comment26 min readLW link

More Re­cent Progress in the The­ory of Neu­ral Networks

jylin046 Oct 2022 16:57 UTC
81 points
6 comments4 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

7 Oct 2022 17:51 UTC
87 points
14 comments3 min readLW link

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

Maxwell Clarke8 Oct 2022 22:14 UTC
10 points
2 comments4 min readLW link

Nat­u­ral Cat­e­gories Update

Logan Zoellner10 Oct 2022 15:19 UTC
33 points
6 comments2 min readLW link

Ar­ti­cle Re­view: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC
8 points
4 comments10 min readLW link

[Question] Pre­vi­ous Work on Re­cre­at­ing Neu­ral Net­work In­put from In­ter­me­di­ate Layer Activations

bglass12 Oct 2022 19:28 UTC
1 point
3 comments1 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
50 points
11 comments4 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:08 UTC
70 points
5 comments12 min readLW link
(www.neelnanda.io)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC
63 points
12 comments3 min readLW link
(neelnanda.io)

A Walk­through of A Math­e­mat­i­cal Frame­work for Trans­former Circuits

Neel Nanda25 Oct 2022 20:24 UTC
51 points
7 comments1 min readLW link
(www.youtube.com)

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
99 points
9 comments9 min readLW link2 reviews
(arxiv.org)

[Book] In­ter­pretable Ma­chine Learn­ing: A Guide for Mak­ing Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC
20 points
1 comment1 min readLW link
(christophm.github.io)

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
48 points
25 comments2 min readLW link

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
34 points
1 comment7 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
69 points
16 comments1 min readLW link
(youtu.be)

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC
28 points
7 comments5 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
30 points
15 comments3 min readLW link
(youtu.be)

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
26 points
9 comments2 min readLW link

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
11 comments1 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
89 points
2 comments12 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
27 points
2 comments2 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

18 Nov 2022 1:43 UTC
75 points
7 comments3 min readLW link
(arxiv.org)

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
85 points
33 comments9 min readLW link

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
20 points
0 comments1 min readLW link
(www.youtube.com)

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
194 points
32 comments31 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
86 points
7 comments14 min readLW link1 review

Us­ing mechanis­tic in­ter­pretabil­ity to find in-dis­tri­bu­tion failure in toy transformers

Charlie George28 Nov 2022 19:39 UTC
6 points
0 comments4 min readLW link

Multi-Com­po­nent Learn­ing and S-Curves

30 Nov 2022 1:37 UTC
61 points
24 comments7 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC
21 points
0 comments11 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
101 points
7 comments16 min readLW link
(ai-alignment.com)

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC
239 points
37 comments8 min readLW link1 review

Re-Ex­am­in­ing LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC
124 points
12 comments5 min readLW link

Sub­sets and quo­tients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC
26 points
1 comment7 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
195 points
35 comments20 min readLW link1 review

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
17 points
4 comments20 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
19 points
11 comments2 min readLW link

[ASoT] Nat­u­ral ab­strac­tions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC
33 points
1 comment1 min readLW link
(arxiv.org)

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam Scherlis13 Dec 2022 0:46 UTC
41 points
4 comments10 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
136 points
22 comments22 min readLW link2 reviews

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
29 points
5 comments11 min readLW link

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
243 points
39 comments16 min readLW link1 review

Paper: Trans­form­ers learn in-con­text by gra­di­ent descent

LawrenceC16 Dec 2022 11:10 UTC
28 points
11 comments2 min readLW link
(arxiv.org)

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristiano16 Dec 2022 19:40 UTC
64 points
3 comments9 min readLW link
(ai-alignment.com)

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
−3 points
9 comments31 min readLW link

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC
82 points
6 comments2 min readLW link
(neelnanda.io)

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

Spencer Becker-Kahn22 Dec 2022 17:21 UTC
14 points
0 comments12 min readLW link

Durkon, an open-source tool for In­her­ently In­ter­pretable Modelling

abstractapplic24 Dec 2022 1:49 UTC
29 points
0 comments4 min readLW link

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel Nanda25 Dec 2022 22:21 UTC
54 points
7 comments12 min readLW link
(www.neelnanda.io)

Analo­gies be­tween Soft­ware Re­v­erse Eng­ineer­ing and Mechanis­tic Interpretability

26 Dec 2022 12:26 UTC
34 points
6 comments11 min readLW link
(www.neelnanda.io)

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC
88 points
30 comments16 min readLW link
(ai-alignment.com)

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
101 points
0 comments10 min readLW link

200 COP in MI: The Case for Analysing Toy Lan­guage Models

Neel Nanda28 Dec 2022 21:07 UTC
39 points
3 comments7 min readLW link

In­ter­nal In­ter­faces Are a High-Pri­or­ity In­ter­pretabil­ity Target

Thane Ruthenis29 Dec 2022 17:49 UTC
26 points
6 comments7 min readLW link

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC
16 points
5 comments13 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
102 points
1 comment18 min readLW link

[Question] Are Mix­ture-of-Ex­perts Trans­form­ers More In­ter­pretable Than Dense Trans­form­ers?

simeon_c31 Dec 2022 11:34 UTC
7 points
5 comments1 min readLW link

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel Nanda31 Dec 2022 19:55 UTC
33 points
2 comments10 min readLW link

In­duc­tion heads—illustrated

CallumMcDougall2 Jan 2023 15:35 UTC
90 points
8 comments3 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
17 points
5 comments6 min readLW link

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC
33 points
6 comments16 min readLW link

Ba­sic Facts about Lan­guage Model Internals

4 Jan 2023 13:01 UTC
130 points
18 comments9 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel Nanda4 Jan 2023 16:08 UTC
16 points
0 comments14 min readLW link

Paper: Su­per­po­si­tion, Me­moriza­tion, and Dou­ble Des­cent (An­thropic)

LawrenceC5 Jan 2023 17:54 UTC
53 points
11 comments1 min readLW link
(transformer-circuits.pub)

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel Nanda6 Jan 2023 15:08 UTC
13 points
0 comments15 min readLW link

AI psy­chol­ogy should ground the the­o­ries of AI con­scious­ness and in­form hu­man-AI eth­i­cal in­ter­ac­tion design

Roman Leventov8 Jan 2023 6:37 UTC
19 points
8 comments2 min readLW link

200 COP in MI: Image Model Interpretability

Neel Nanda8 Jan 2023 14:53 UTC
18 points
3 comments6 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC
48 points
14 comments8 min readLW link

200 COP in MI: In­ter­pret­ing Re­in­force­ment Learning

Neel Nanda10 Jan 2023 17:37 UTC
23 points
1 comment10 min readLW link

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybear13 Jan 2023 0:28 UTC
11 points
3 comments12 min readLW link

How does GPT-3 spend its 175B pa­ram­e­ters?

Robert_AIZI13 Jan 2023 19:21 UTC
40 points
13 comments6 min readLW link
(aizi.substack.com)

World-Model In­ter­pretabil­ity Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC
29 points
22 comments21 min readLW link

Reflec­tions on Trust­ing Trust & AI

Itay Yona16 Jan 2023 6:36 UTC
10 points
1 comment3 min readLW link
(mentaleap.ai)

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC
157 points
26 comments53 min readLW link
(www.jessehoogland.com)

[Question] Trans­former Mech In­terp: Any vi­su­al­iza­tions?

Joyee Chen18 Jan 2023 4:32 UTC
3 points
0 comments1 min readLW link

200 COP in MI: Study­ing Learned Fea­tures in Lan­guage Models

Neel Nanda19 Jan 2023 3:48 UTC
24 points
2 comments30 min readLW link

Large lan­guage mod­els learn to rep­re­sent the world

gjm22 Jan 2023 13:10 UTC
102 points
19 comments3 min readLW link

De­con­fus­ing “Ca­pa­bil­ities vs. Align­ment”

RobertM23 Jan 2023 4:46 UTC
27 points
7 comments2 min readLW link

List of links: For­mal Meth­ods, Embed­ded Agency, 3d world mod­els, and some tools

the gears to ascension23 Jan 2023 18:21 UTC
16 points
1 comment17 min readLW link

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC
47 points
5 comments13 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
47 points
6 comments12 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
61 points
4 comments7 min readLW link
(www.jessehoogland.com)

Mechanis­tic In­ter­pretabil­ity Quick­start Guide

Neel Nanda31 Jan 2023 16:35 UTC
42 points
3 comments6 min readLW link
(www.neelnanda.io)

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZI31 Jan 2023 18:48 UTC
29 points
7 comments6 min readLW link
(aizi.substack.com)

More find­ings on Me­moriza­tion and dou­ble descent

Marius Hobbhahn1 Feb 2023 18:26 UTC
53 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC
27 points
1 comment11 min readLW link

ChatGPT: Tan­tal­iz­ing af­terthoughts in search of story tra­jec­to­ries [in­duc­tion heads]

Bill Benzon3 Feb 2023 10:35 UTC
4 points
0 comments20 min readLW link

AXRP Epi­sode 19 - Mechanis­tic In­ter­pretabil­ity with Neel Nanda

DanielFilan4 Feb 2023 3:00 UTC
44 points
0 comments117 min readLW link

Mech In­terp Pro­ject Ad­vis­ing Call: Me­mori­sa­tion in GPT-2 Small

Neel Nanda4 Feb 2023 14:17 UTC
7 points
0 comments1 min readLW link

Some mis­cel­la­neous thoughts on ChatGPT, sto­ries, and me­chan­i­cal interpretability

Bill Benzon4 Feb 2023 19:35 UTC
2 points
0 comments3 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
661 points
204 comments12 min readLW link

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC
33 points
6 comments14 min readLW link
(www.jessehoogland.com)

De­ci­sion Trans­former Interpretability

6 Feb 2023 7:29 UTC
83 points
13 comments24 min readLW link

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC
10 points
2 comments5 min readLW link
(aizi.substack.com)

[ASoT] Policy Tra­jec­tory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC
9 points
2 comments1 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC
72 points
32 comments7 min readLW link
(bayesianinvestor.com)

A multi-dis­ci­plinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC
43 points
4 comments26 min readLW link

On Devel­op­ing a Math­e­mat­i­cal The­ory of In­ter­pretabil­ity

Spencer Becker-Kahn9 Feb 2023 1:45 UTC
63 points
8 comments6 min readLW link

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro

scasper9 Feb 2023 16:28 UTC
45 points
24 comments3 min readLW link

EIS II: What is “In­ter­pretabil­ity”?

scasper9 Feb 2023 16:48 UTC
27 points
6 comments4 min readLW link

We Found An Neu­ron in GPT-2

11 Feb 2023 18:27 UTC
141 points
22 comments7 min readLW link
(clementneo.com)

The con­cep­tual Dop­pelgänger problem

TsviBT12 Feb 2023 17:23 UTC
12 points
5 comments4 min readLW link

Idea: Net­work mod­u­lar­ity and in­ter­pretabil­ity by sex­ual reproduction

qbolec12 Feb 2023 23:06 UTC
3 points
3 comments1 min readLW link

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC
70 points
11 comments15 min readLW link

Ex­plain­ing SolidGoldMag­ikarp by look­ing at it from ran­dom directions

Robert_AIZI14 Feb 2023 14:54 UTC
8 points
0 comments8 min readLW link
(aizi.substack.com)

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

scasper14 Feb 2023 18:24 UTC
18 points
2 comments11 min readLW link

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

scasper15 Feb 2023 18:46 UTC
18 points
1 comment4 min readLW link

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasper16 Feb 2023 19:09 UTC
54 points
23 comments13 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasper17 Feb 2023 20:48 UTC
48 points
9 comments12 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC
34 points
4 comments3 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
20 points
5 comments4 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
7 comments8 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
91 points
6 comments21 min readLW link

Ba­sic facts about lan­guage mod­els dur­ing training

beren21 Feb 2023 11:46 UTC
96 points
14 comments18 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
4 comments3 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
34 points
11 comments3 min readLW link
(www.jessehoogland.com)

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
30 points
1 comment9 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

Video/​an­i­ma­tion: Neel Nanda ex­plains what mechanis­tic in­ter­pretabil­ity is

DanielFilan22 Feb 2023 22:42 UTC
24 points
7 comments1 min readLW link
(youtu.be)

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
17 points
0 comments6 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
50 points
0 comments19 min readLW link

In­ter­pret­ing Embed­ding Spaces by Conceptualization

Adi Simhi28 Feb 2023 18:38 UTC
3 points
0 comments1 min readLW link
(arxiv.org)

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing Du1 Mar 2023 1:47 UTC
146 points
8 comments30 min readLW link

My cur­rent think­ing about ChatGPT @3QD [Gär­den­fors, Wolfram, and the value of spec­u­la­tion]

Bill Benzon1 Mar 2023 10:50 UTC
2 points
0 comments5 min readLW link

ChatGPT tells sto­ries, and a note about re­verse en­g­ineer­ing: A Work­ing Paper

Bill Benzon3 Mar 2023 15:12 UTC
3 points
0 comments3 min readLW link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 16:16 UTC
99 points
11 comments1 min readLW link

Ad­den­dum: ba­sic facts about lan­guage mod­els dur­ing training

beren6 Mar 2023 19:24 UTC
22 points
2 comments5 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
137 points
16 comments18 min readLW link
(www.erichgrunewald.com)

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
126 points
6 comments19 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
312 points
22 comments23 min readLW link

Paper Repli­ca­tion Walk­through: Re­v­erse-Eng­ineer­ing Mo­du­lar Addition

Neel Nanda12 Mar 2023 13:25 UTC
18 points
0 comments1 min readLW link
(neelnanda.io)

Creat­ing a Dis­cord server for Mechanis­tic In­ter­pretabil­ity Projects

Victor Levoso12 Mar 2023 18:00 UTC
30 points
6 comments2 min readLW link

At­tri­bu­tion Patch­ing: Ac­ti­va­tion Patch­ing At In­dus­trial Scale

Neel Nanda16 Mar 2023 21:44 UTC
43 points
10 comments58 min readLW link
(www.neelnanda.io)

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse Hoogland22 Mar 2023 16:58 UTC
32 points
5 comments1 min readLW link

How Do In­duc­tion Heads Ac­tu­ally Work in Trans­form­ers With Finite Ca­pac­ity?

Fabien Roger23 Mar 2023 9:09 UTC
27 points
0 comments5 min readLW link

Wittgen­stein and ML — pa­ram­e­ters vs architecture

Cleo Nardo24 Mar 2023 4:54 UTC
37 points
8 comments5 min readLW link

Sen­tience in Machines—How Do We Test for This Ob­jec­tively?

Mayowa Osibodu26 Mar 2023 18:56 UTC
−2 points
0 comments2 min readLW link
(www.researchgate.net)

LLM Mo­du­lar­ity: The Separa­bil­ity of Ca­pa­bil­ities in Large Lan­guage Models

NickyP26 Mar 2023 21:57 UTC
97 points
3 comments41 min readLW link

Prac­ti­cal Pit­falls of Causal Scrubbing

27 Mar 2023 7:47 UTC
87 points
17 comments13 min readLW link

Some com­mon con­fu­sion about in­duc­tion heads

Alexandre Variengien28 Mar 2023 21:51 UTC
42 points
4 comments5 min readLW link

Spread­sheet for 200 Con­crete Prob­lems In Interpretability

Jay Bailey29 Mar 2023 6:51 UTC
12 points
0 comments1 min readLW link

Ac­tu­ally, Othello-GPT Has A Lin­ear Emer­gent World Representation

Neel Nanda29 Mar 2023 22:13 UTC
210 points
24 comments19 min readLW link
(neelnanda.io)

Othello-GPT: Fu­ture Work I Am Ex­cited About

Neel Nanda29 Mar 2023 22:13 UTC
48 points
2 comments33 min readLW link
(neelnanda.io)

Othello-GPT: Reflec­tions on the Re­search Process

Neel Nanda29 Mar 2023 22:13 UTC
35 points
0 comments15 min readLW link
(neelnanda.io)

The Quan­ti­za­tion Model of Neu­ral Scaling

nz31 Mar 2023 16:02 UTC
17 points
0 comments1 min readLW link
(arxiv.org)

Maze-solv­ing agents: Add a top-right vec­tor, make the agent go to the top-right

31 Mar 2023 19:20 UTC
101 points
17 comments11 min readLW link

AISC 2023, Progress Re­port for March: Team In­ter­pretable Architectures

2 Apr 2023 16:19 UTC
14 points
0 comments14 min readLW link

Ex­plo­ra­tory Anal­y­sis of RLHF Trans­form­ers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC
21 points
2 comments11 min readLW link
(blog.eleuther.ai)

If in­ter­pretabil­ity re­search goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC
197 points
10 comments2 min readLW link

Gi­ant (In)scrutable Ma­tri­ces: (Maybe) the Best of All Pos­si­ble Worlds

1a3orn4 Apr 2023 17:39 UTC
191 points
35 comments5 min readLW link

Univer­sal­ity and Hid­den In­for­ma­tion in Con­cept Bot­tle­neck Models

Hoagy5 Apr 2023 14:00 UTC
23 points
0 comments11 min readLW link

Bing AI Gen­er­at­ing Voyn­ich Manuscript Con­tinu­a­tions—It does not know how it knows

Matthew_Opitz10 Apr 2023 20:22 UTC
15 points
6 comments13 min readLW link

No con­vinc­ing ev­i­dence for gra­di­ent de­scent in ac­ti­va­tion space

Blaine12 Apr 2023 4:48 UTC
76 points
8 comments20 min readLW link

Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
9 points
7 comments12 min readLW link

Iden­ti­fy­ing se­man­tic neu­rons, mechanis­tic cir­cuits & in­ter­pretabil­ity web apps

13 Apr 2023 11:59 UTC
18 points
0 comments8 min readLW link

Shap­ley Value At­tri­bu­tion in Chain of Thought

leogao14 Apr 2023 5:56 UTC
101 points
5 comments4 min readLW link

Re­search Re­port: In­cor­rect­ness Cascades

Robert_AIZI14 Apr 2023 12:49 UTC
19 points
0 comments10 min readLW link
(aizi.substack.com)

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC
71 points
18 comments6 min readLW link

Mechanis­ti­cally in­ter­pret­ing time in GPT-2 small

16 Apr 2023 17:57 UTC
68 points
6 comments21 min readLW link

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

19 Apr 2023 14:19 UTC
68 points
3 comments16 min readLW link

Lan­guage Models are a Po­ten­tially Safe Path to Hu­man-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC
28 points
6 comments8 min readLW link

An in­tro­duc­tion to lan­guage model interpretability

Alexandre Variengien20 Apr 2023 22:22 UTC
14 points
0 comments9 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
44 points
11 comments10 min readLW link

Should we pub­lish mechanis­tic in­ter­pretabil­ity re­search?

21 Apr 2023 16:19 UTC
105 points
40 comments13 min readLW link

Neu­ral net­work poly­topes (Co­lab note­book)

Zach Furman21 Apr 2023 22:42 UTC
11 points
0 comments1 min readLW link
(colab.research.google.com)

Deep learn­ing mod­els might be se­cretly (al­most) linear

beren24 Apr 2023 18:43 UTC
110 points
28 comments4 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix Hofstätter25 Apr 2023 13:45 UTC
8 points
0 comments15 min readLW link

I was Wrong, Si­mu­la­tor The­ory is Real

Robert_AIZI26 Apr 2023 17:45 UTC
75 points
7 comments3 min readLW link
(aizi.substack.com)

Dropout can cre­ate a priv­ileged ba­sis in the ReLU out­put model.

lsgos28 Apr 2023 1:59 UTC
24 points
3 comments5 min readLW link

PhD Po­si­tion: AI In­ter­pretabil­ity in Ber­lin, Germany

Tiberius28 Apr 2023 13:44 UTC
3 points
0 comments1 min readLW link
(stephanw.net)

A small up­date to the Sparse Cod­ing in­terim re­search report

30 Apr 2023 19:54 UTC
61 points
5 comments1 min readLW link

AXRP Epi­sode 21 - In­ter­pretabil­ity for Eng­ineers with Stephen Casper

DanielFilan2 May 2023 0:50 UTC
12 points
1 comment66 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
30 points
5 comments2 min readLW link
(arxiv.org)

[Linkpost]Trans­former-Based LM Sur­prisal Pre­dicts Hu­man Read­ing Times Best with About Two Billion Train­ing Tokens

Curtis Huebner4 May 2023 17:16 UTC
10 points
1 comment1 min readLW link
(arxiv.org)

Resi­d­ual stream norms grow ex­po­nen­tially over the for­ward pass

7 May 2023 0:46 UTC
72 points
24 comments11 min readLW link

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

7 May 2023 17:00 UTC
42 points
2 comments10 min readLW link

A tech­ni­cal note on bil­in­ear lay­ers for interpretability

Lee Sharkey8 May 2023 6:06 UTC
50 points
0 comments1 min readLW link
(arxiv.org)

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC
22 points
9 comments6 min readLW link

Ex­cit­ing New In­ter­pretabil­ity Paper!

research_prime_space9 May 2023 16:39 UTC
12 points
1 comment1 min readLW link

Lan­guage mod­els can ex­plain neu­rons in lan­guage models

nz9 May 2023 17:29 UTC
23 points
0 comments1 min readLW link
(openai.com)

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 1

9 May 2023 19:41 UTC
119 points
1 comment10 min readLW link

[Question] Have you heard about MIT’s “liquid neu­ral net­works”? What do you think about them?

Ppau9 May 2023 20:16 UTC
35 points
14 comments1 min readLW link

New OpenAI Paper—Lan­guage mod­els can ex­plain neu­rons in lan­guage models

ViktorThink10 May 2023 7:46 UTC
47 points
14 comments1 min readLW link

AGI-Au­to­mated In­ter­pretabil­ity is Suicide

__RicG__10 May 2023 14:20 UTC
23 points
33 comments7 min readLW link

[Question] AI in­ter­pretabil­ity could be harm­ful?

Roman Leventov10 May 2023 20:43 UTC
13 points
2 comments1 min readLW link

In­put Swap Graphs: Dis­cov­er­ing the role of neu­ral net­work com­po­nents at scale

Alexandre Variengien12 May 2023 9:41 UTC
90 points
0 comments33 min readLW link

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
416 points
97 comments50 min readLW link

My cur­rent work­flow to study the in­ter­nal mechanisms of LLM

Yulu Pi16 May 2023 15:27 UTC
3 points
0 comments1 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of a GridWorld Agent-Si­mu­la­tor (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC
36 points
2 comments16 min readLW link

Gen­der Vec­tors in ROME’s La­tent Space

Xodarap21 May 2023 18:46 UTC
14 points
2 comments3 min readLW link

Ac­ti­va­tion ad­di­tions in a small resi­d­ual network

Garrett Baker22 May 2023 20:28 UTC
22 points
4 comments3 min readLW link

‘Fun­da­men­tal’ vs ‘ap­plied’ mechanis­tic in­ter­pretabil­ity research

Lee Sharkey23 May 2023 18:26 UTC
62 points
6 comments3 min readLW link

[Linkpost] In­ter­pretabil­ity Dreams

DanielFilan24 May 2023 21:08 UTC
39 points
2 comments2 min readLW link
(transformer-circuits.pub)

Solv­ing the Mechanis­tic In­ter­pretabil­ity challenges: EIS VII Challenge 2

25 May 2023 15:37 UTC
71 points
1 comment13 min readLW link

Why and When In­ter­pretabil­ity Work is Dangerous

NicholasKross28 May 2023 0:27 UTC
20 points
7 comments8 min readLW link
(www.thinkingmuchbetter.com)

The king token

p.b.28 May 2023 19:18 UTC
17 points
0 comments4 min readLW link

Align­ing an H-JEPA agent via train­ing on the out­puts of an LLM-based “ex­em­plary ac­tor”

Roman Leventov29 May 2023 11:08 UTC
12 points
10 comments30 min readLW link

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
215 points
11 comments8 min readLW link

Con­trast Pairs Drive the Em­piri­cal Perfor­mance of Con­trast Con­sis­tent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC
97 points
0 comments6 min readLW link

Short Re­mark on the (sub­jec­tive) math­e­mat­i­cal ‘nat­u­ral­ness’ of the Nanda—Lie­berum ad­di­tion mod­ulo 113 algorithm

Spencer Becker-Kahn1 Jun 2023 11:31 UTC
104 points
12 comments2 min readLW link

A com­par­i­son of causal scrub­bing, causal ab­strac­tions, and re­lated methods

8 Jun 2023 23:40 UTC
72 points
3 comments22 min readLW link

Ex­plor­ing Con­cept-Spe­cific Slices in Weight Ma­tri­ces for Net­work Interpretability

DuncanFowler9 Jun 2023 16:39 UTC
1 point
0 comments6 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likenneth11 Jun 2023 5:38 UTC
195 points
4 comments1 min readLW link
(arxiv.org)

[Linkpost] Rosetta Neu­rons: Min­ing the Com­mon Units in a Model Zoo

Bogdan Ionut Cirstea17 Jun 2023 16:38 UTC
12 points
0 comments1 min readLW link

[Re­search Up­date] Sparse Au­toen­coder fea­tures are bimodal

Robert_AIZI22 Jun 2023 13:15 UTC
23 points
1 comment5 min readLW link
(aizi.substack.com)

(ten­ta­tively) Found 600+ Monose­man­tic Fea­tures in a Small LM Us­ing Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC
57 points
1 comment7 min readLW link

The risk-re­ward trade­off of in­ter­pretabil­ity research

5 Jul 2023 17:05 UTC
15 points
1 comment6 min readLW link

Lo­cal­iz­ing goal mis­gen­er­al­iza­tion in a maze-solv­ing policy network

jan betley6 Jul 2023 16:21 UTC
37 points
2 comments7 min readLW link

In­ter­pret­ing Mo­du­lar Ad­di­tion in MLPs

Bart Bussmann7 Jul 2023 9:22 UTC
15 points
0 comments6 min readLW link

LLM mis­al­ign­ment can prob­a­bly be found with­out man­ual prompt engineering

ProgramCrafter8 Jul 2023 14:35 UTC
1 point
0 comments1 min readLW link

Really Strong Fea­tures Found in Resi­d­ual Stream

Logan Riggs8 Jul 2023 19:40 UTC
67 points
6 comments2 min readLW link

fMRI LIKE APPROACH TO AI ALIGNMENT /​ DECEPTIVE BEHAVIOUR

Escaque 6611 Jul 2023 17:17 UTC
−1 points
3 comments2 min readLW link

Towards Devel­op­men­tal Interpretability

12 Jul 2023 19:33 UTC
171 points
8 comments9 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Rimsky16 Jul 2023 4:17 UTC
49 points
1 comment3 min readLW link

Mech In­terp Puz­zle 1: Sus­pi­ciously Similar Embed­dings in GPT-Neo

Neel Nanda16 Jul 2023 22:02 UTC
65 points
15 comments1 min readLW link

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC
53 points
1 comment7 min readLW link

Still no Lie De­tec­tor for LLMs

18 Jul 2023 19:56 UTC
47 points
2 comments21 min readLW link

Tiny Mech In­terp Pro­jects: Emer­gent Po­si­tional Embed­dings of Words

Neel Nanda18 Jul 2023 21:24 UTC
48 points
1 comment9 min readLW link

He­donic Loops and Tam­ing RL

beren19 Jul 2023 15:12 UTC
20 points
14 comments9 min readLW link

Desider­ata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC
8 points
0 comments4 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
43 points
3 comments2 min readLW link
(arxiv.org)

GPT-2′s po­si­tional em­bed­ding ma­trix is a helix

AdamYedidia21 Jul 2023 4:16 UTC
42 points
18 comments4 min readLW link

[Linkpost] In­ter­pret­ing Mul­ti­modal Video Trans­form­ers Us­ing Brain Recordings

Bogdan Ionut Cirstea21 Jul 2023 11:26 UTC
5 points
0 comments1 min readLW link

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

21 Jul 2023 14:52 UTC
56 points
1 comment1 min readLW link

Open prob­lems in ac­ti­va­tion engineering

24 Jul 2023 19:46 UTC
43 points
2 comments1 min readLW link
(coda.io)

Neu­ron­pe­dia—AI Safety Game

Johnny Lin26 Jul 2023 16:29 UTC
135 points
51 comments2 min readLW link
(neuronpedia.org)

AXRP Epi­sode 23 - Mechanis­tic Ano­maly De­tec­tion with Mark Xu

DanielFilan27 Jul 2023 1:50 UTC
22 points
0 comments72 min readLW link

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasper27 Jul 2023 23:05 UTC
18 points
0 comments2 min readLW link

Mech In­terp Puz­zle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC
40 points
4 comments2 min readLW link

Visi­ble loss land­scape bas­ins don’t cor­re­spond to dis­tinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC
65 points
13 comments4 min readLW link

AI Safety 101 : In­tro­duc­tion to Vi­sion Interpretability

28 Jul 2023 17:32 UTC
41 points
0 comments1 min readLW link
(github.com)

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC
12 points
3 comments13 min readLW link

Thoughts about the Mechanis­tic In­ter­pretabil­ity Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC
23 points
5 comments20 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

[Linkpost] Mul­ti­modal Neu­rons in Pre­trained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC
11 points
0 comments1 min readLW link

Ground-Truth La­bel Im­bal­ance Im­pairs the Perfor­mance of Con­trast-Con­sis­tent Search (and Other Con­trast-Pair-Based Un­su­per­vised Meth­ods)

5 Aug 2023 17:55 UTC
6 points
2 comments7 min readLW link
(drive.google.com)

Grow­ing Bon­sai Net­works with RNNs

ameo7 Aug 2023 17:34 UTC
21 points
5 comments1 min readLW link
(cprimozic.net)

An in­ter­ac­tive in­tro­duc­tion to grokking and mechanis­tic interpretability

7 Aug 2023 19:09 UTC
23 points
3 comments1 min readLW link
(pair.withgoogle.com)

Mech In­terp Challenge: Au­gust—De­ci­pher­ing the First Unique Char­ac­ter Model

CallumMcDougall9 Aug 2023 19:14 UTC
34 points
1 comment3 min readLW link

The po­si­tional em­bed­ding ma­trix and pre­vi­ous-to­ken heads: how do they ac­tu­ally work?

AdamYedidia10 Aug 2023 1:58 UTC
26 points
4 comments13 min readLW link

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

14 Aug 2023 17:04 UTC
82 points
3 comments1 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

15 Aug 2023 21:13 UTC
19 points
0 comments17 min readLW link

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensen17 Aug 2023 13:53 UTC
21 points
0 comments14 min readLW link

Against Al­most Every The­ory of Im­pact of Interpretability

Charbel-Raphaël17 Aug 2023 18:44 UTC
313 points
82 comments26 min readLW link

Memetic Judo #3: The In­tel­li­gence of Stochas­tic Par­rots v.2

Max TK20 Aug 2023 15:18 UTC
8 points
33 comments6 min readLW link

Causal­ity and a Cost Se­man­tics for Neu­ral Networks

scottviteri21 Aug 2023 21:02 UTC
22 points
1 comment9 min readLW link

En­hanc­ing Cor­rigi­bil­ity in AI Sys­tems through Ro­bust Feed­back Loops

Justausername24 Aug 2023 3:53 UTC
1 point
0 comments6 min readLW link

[Question] Would it be use­ful to col­lect the con­texts, where var­i­ous LLMs think the same?

Martin Vlach24 Aug 2023 22:01 UTC
6 points
1 comment1 min readLW link

Ap­ply for the 2023 Devel­op­men­tal In­ter­pretabil­ity Con­fer­ence!

25 Aug 2023 7:12 UTC
33 points
0 comments2 min readLW link

An OV-Co­her­ent Toy Model of At­ten­tion Head Su­per­po­si­tion

25 Aug 2023 18:26 UTC
9 points
0 comments6 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
74 points
3 comments1 min readLW link

Bar­ri­ers to Mechanis­tic In­ter­pretabil­ity for AGI Safety

Connor Leahy29 Aug 2023 10:56 UTC
68 points
13 comments1 min readLW link
(www.youtube.com)

An OV-Co­her­ent Toy Model of At­ten­tion Head Superposition

29 Aug 2023 19:44 UTC
14 points
0 comments6 min readLW link

Paper Walk­through: Au­to­mated Cir­cuit Dis­cov­ery with Arthur Conmy

Neel Nanda29 Aug 2023 22:07 UTC
36 points
1 comment1 min readLW link
(www.youtube.com)

Open Call for Re­search As­sis­tants in Devel­op­men­tal Interpretability

30 Aug 2023 9:02 UTC
54 points
11 comments4 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

In­ter­pret­ing a ma­trix-val­ued word em­bed­ding with a math­e­mat­i­cally proven char­ac­ter­i­za­tion of all optima

Joseph Van Name4 Sep 2023 16:19 UTC
3 points
4 comments12 min readLW link

Ex­plain­ing grokking through cir­cuit efficiency

8 Sep 2023 14:39 UTC
98 points
8 comments3 min readLW link
(arxiv.org)

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob Dunefsky12 Sep 2023 17:38 UTC
13 points
0 comments29 min readLW link

Seek­ing Feed­back on My Mechanis­tic In­ter­pretabil­ity Re­search Agenda

RGRGRG12 Sep 2023 18:45 UTC
3 points
1 comment3 min readLW link

Ex­pand­ing the Scope of Superposition

Derek Larson13 Sep 2023 17:38 UTC
10 points
0 comments4 min readLW link

Mech In­terp Challenge: Septem­ber—De­ci­pher­ing the Ad­di­tion Model

CallumMcDougall13 Sep 2023 22:23 UTC
35 points
0 comments4 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

14 Sep 2023 1:40 UTC
32 points
7 comments8 min readLW link
(far.ai)

High­lights: Went­worth, Shah, and Mur­phy on “Re­tar­get­ing the Search”

RobertM14 Sep 2023 2:18 UTC
85 points
4 comments8 min readLW link

Three ways in­ter­pretabil­ity could be impactful

Arthur Conmy18 Sep 2023 1:02 UTC
47 points
8 comments4 min readLW link

In­ter­pretabil­ity Ex­ter­nal­ities Case Study—Hun­gry Hun­gry Hippos

Magdalena Wache20 Sep 2023 14:42 UTC
64 points
22 comments2 min readLW link

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

21 Sep 2023 15:30 UTC
154 points
7 comments5 min readLW link

Neel Nanda on the Mechanis­tic In­ter­pretabil­ity Re­searcher Mindset

Michaël Trazzi21 Sep 2023 19:47 UTC
36 points
1 comment3 min readLW link
(theinsideview.ai)

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
29 points
8 comments5 min readLW link

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
42 points
7 comments34 min readLW link

In­ter­pret­ing OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC
112 points
10 comments7 min readLW link

Im­pact sto­ries for model in­ter­nals: an ex­er­cise for in­ter­pretabil­ity researchers

jenny25 Sep 2023 23:15 UTC
29 points
3 comments7 min readLW link

An­nounc­ing the CNN In­ter­pretabil­ity Competition

scasper26 Sep 2023 16:21 UTC
22 points
0 comments4 min readLW link

Mechanis­tic In­ter­pretabil­ity Read­ing group

26 Sep 2023 16:26 UTC
8 points
0 comments1 min readLW link

High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

28 Sep 2023 19:30 UTC
69 points
4 comments21 min readLW link

New Tool: the Resi­d­ual Stream Viewer

AdamYedidia1 Oct 2023 0:49 UTC
32 points
7 comments4 min readLW link
(tinyurl.com)

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
11 points
0 comments5 min readLW link

Mech In­terp Challenge: Oc­to­ber—De­ci­pher­ing the Sorted List Model

CallumMcDougall3 Oct 2023 10:57 UTC
23 points
0 comments3 min readLW link

What would it mean to un­der­stand how a large lan­guage model (LLM) works? Some quick notes.

Bill Benzon3 Oct 2023 15:11 UTC
20 points
4 comments8 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
132 points
8 comments19 min readLW link

En­tan­gle­ment and in­tu­ition about words and mean­ing

Bill Benzon4 Oct 2023 14:16 UTC
4 points
0 comments2 min readLW link

Ideation and Tra­jec­tory Model­ling in Lan­guage Models

NickyP5 Oct 2023 19:21 UTC
15 points
2 comments10 min readLW link

Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
281 points
18 comments2 min readLW link
(transformer-circuits.pub)

A per­sonal ex­pla­na­tion of ELK con­cept and task.

Zeyu Qin6 Oct 2023 3:55 UTC
1 point
0 comments1 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
136 points
8 comments4 min readLW link

Bird-eye view vi­su­al­iza­tion of LLM activations

Sergii8 Oct 2023 12:12 UTC
11 points
2 comments1 min readLW link
(grgv.xyz)

You’re Mea­sur­ing Model Com­plex­ity Wrong

11 Oct 2023 11:46 UTC
82 points
15 comments13 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC
20 points
0 comments6 min readLW link

Un­der­stand­ing LLMs: Some ba­sic ob­ser­va­tions about words, syn­tax, and dis­course [w/​ a con­jec­ture about grokking]

Bill Benzon11 Oct 2023 19:13 UTC
5 points
0 comments5 min readLW link

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

13 Oct 2023 1:38 UTC
69 points
0 comments1 min readLW link
(arxiv.org)

[Paper] All’s Fair In Love And Love: Copy Sup­pres­sion in GPT-2 Small

13 Oct 2023 18:32 UTC
82 points
4 comments8 min readLW link

ChatGPT tells 20 ver­sions of its pro­to­typ­i­cal story, with a short note on method

Bill Benzon14 Oct 2023 15:27 UTC
6 points
0 comments5 min readLW link

Map­ping ChatGPT’s on­tolog­i­cal land­scape, gra­di­ents and choices [in­ter­pretabil­ity]

Bill Benzon15 Oct 2023 20:12 UTC
1 point
0 comments18 min readLW link

[Question] Does a broad overview of Mechanis­tic In­ter­pretabil­ity ex­ist?

kourabi16 Oct 2023 1:16 UTC
1 point
0 comments1 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

17 Oct 2023 19:51 UTC
85 points
4 comments12 min readLW link

Fea­tures and Ad­ver­saries in MemoryDT

20 Oct 2023 7:32 UTC
30 points
6 comments25 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC
117 points
14 comments22 min readLW link

Mechanis­tic in­ter­pretabil­ity of LLM anal­ogy-making

Sergii20 Oct 2023 12:53 UTC
2 points
0 comments4 min readLW link
(grgv.xyz)

In­ter­nal Tar­get In­for­ma­tion for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC
15 points
0 comments5 min readLW link

[Question] Can we iso­late neu­rons that rec­og­nize fea­tures vs. those which have some other role?

Joshua Clancy21 Oct 2023 0:30 UTC
4 points
2 comments3 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
66 points
2 comments6 min readLW link

An­nounc­ing Timaeus

22 Oct 2023 11:59 UTC
186 points
15 comments4 min readLW link

Ma­chine Un­learn­ing Eval­u­a­tions as In­ter­pretabil­ity Benchmarks

23 Oct 2023 16:33 UTC
33 points
2 comments11 min readLW link

z is not the cause of x

hrbigelow23 Oct 2023 17:43 UTC
6 points
2 comments9 min readLW link

Char­bel-Raphaël and Lu­cius dis­cuss Interpretability

30 Oct 2023 5:50 UTC
104 points
7 comments21 min readLW link

Grokking Beyond Neu­ral Networks

Jack Miller30 Oct 2023 17:28 UTC
9 points
0 comments2 min readLW link
(arxiv.org)

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

1 Nov 2023 12:46 UTC
15 points
1 comment7 min readLW link

Es­ti­mat­ing effec­tive di­men­sion­al­ity of MNIST models

Arjun Panickssery2 Nov 2023 14:13 UTC
40 points
3 comments1 min readLW link

Mech In­terp Challenge: Novem­ber—De­ci­pher­ing the Cu­mu­la­tive Sum Model

CallumMcDougall2 Nov 2023 17:10 UTC
18 points
2 comments2 min readLW link

Growth and Form in a Toy Model of Superposition

8 Nov 2023 11:08 UTC
87 points
5 comments14 min readLW link

What’s go­ing on? LLMs and IS-A sen­tences

Bill Benzon8 Nov 2023 16:58 UTC
6 points
15 comments4 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
46 points
0 comments6 min readLW link

AISC Pro­ject: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyP13 Nov 2023 14:33 UTC
25 points
0 comments12 min readLW link

Is In­ter­pretabil­ity All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC
1 point
1 comment1 min readLW link

A fram­ing for interpretability

Nina Rimsky14 Nov 2023 16:14 UTC
69 points
5 comments4 min readLW link
(ninarimsky.substack.com)

In­ci­den­tal polysemanticity

15 Nov 2023 4:00 UTC
41 points
7 comments11 min readLW link

Elic­it­ing La­tent Knowl­edge in Com­pre­hen­sive AI Ser­vices Models

acabodi17 Nov 2023 2:36 UTC
6 points
0 comments5 min readLW link

AISC pro­ject: TinyEvals

Jett22 Nov 2023 20:47 UTC
17 points
0 comments4 min readLW link

A day in the life of a mechanis­tic in­ter­pretabil­ity researcher

Bill Benzon28 Nov 2023 14:45 UTC
3 points
3 comments1 min readLW link

In­tro to Su­per­po­si­tion & Sparse Au­toen­coders (Co­lab ex­er­cises)

CallumMcDougall29 Nov 2023 12:56 UTC
65 points
8 comments3 min readLW link

How use­ful is mechanis­tic in­ter­pretabil­ity?

1 Dec 2023 2:54 UTC
148 points
53 comments25 min readLW link

Mechanis­tic in­ter­pretabil­ity through clustering

Alistair Fraser4 Dec 2023 18:49 UTC
1 point
0 comments1 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
106 points
29 comments13 min readLW link

Lan­guage Model Me­moriza­tion, Copy­right Law, and Con­di­tional Pre­train­ing Alignment

RogerDearnaley7 Dec 2023 6:14 UTC
3 points
0 comments11 min readLW link

Re­sults from the Tur­ing Sem­i­nar hackathon

7 Dec 2023 14:50 UTC
29 points
1 comment6 min readLW link

Colour ver­sus Shape Goal Mis­gen­er­al­iza­tion in Re­in­force­ment Learn­ing: A Case Study

Karolis Ramanauskas8 Dec 2023 13:18 UTC
13 points
1 comment4 min readLW link
(arxiv.org)

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

8 Dec 2023 17:08 UTC
78 points
7 comments7 min readLW link

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
66 points
5 comments10 min readLW link

Has any­one ex­per­i­mented with Do­drio, a tool for ex­plor­ing trans­former mod­els through in­ter­ac­tive vi­su­al­iza­tion?

Bill Benzon11 Dec 2023 20:34 UTC
4 points
0 comments1 min readLW link

Towards an Ethics Calcu­la­tor for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC
2 points
2 comments11 min readLW link

Cat­e­gor­i­cal Or­ga­ni­za­tion in Me­mory: ChatGPT Or­ga­nizes the 665 Topic Tags from My New Sa­vanna Blog

Bill Benzon14 Dec 2023 13:02 UTC
0 points
6 comments2 min readLW link

Map­ping the se­man­tic void: Strange go­ings-on in GPT em­bed­ding spaces

mwatkins14 Dec 2023 13:10 UTC
114 points
30 comments14 min readLW link

Take­aways from a Mechanis­tic In­ter­pretabil­ity pro­ject on “For­bid­den Facts”

15 Dec 2023 11:05 UTC
33 points
8 comments10 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
8 comments9 min readLW link

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:00 UTC
13 points
1 comment1 min readLW link

A Univer­sal Emer­gent De­com­po­si­tion of Retrieval Tasks in Lan­guage Models

19 Dec 2023 11:52 UTC
81 points
3 comments10 min readLW link
(arxiv.org)

How does a toy 2 digit sub­trac­tion trans­former pre­dict the sign of the out­put?

Evan Anders19 Dec 2023 18:56 UTC
14 points
0 comments8 min readLW link
(evanhanders.blog)

How does a toy 2 digit sub­trac­tion trans­former pre­dict the differ­ence?

Evan Anders22 Dec 2023 21:17 UTC
12 points
0 comments10 min readLW link
(evanhanders.blog)

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
106 points
4 comments22 min readLW link

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

23 Dec 2023 2:45 UTC
18 points
3 comments14 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

23 Dec 2023 2:46 UTC
22 points
0 comments9 min readLW link

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

23 Dec 2023 2:46 UTC
18 points
0 comments4 min readLW link

Ex­plor­ing the Resi­d­ual Stream of Trans­form­ers for Mechanis­tic In­ter­pretabil­ity — Explained

Zeping Yu26 Dec 2023 0:36 UTC
7 points
1 comment11 min readLW link

Mech In­terp Challenge: Jan­uary—De­ci­pher­ing the Cae­sar Cipher Model

CallumMcDougall1 Jan 2024 18:03 UTC
17 points
0 comments3 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC
35 points
4 comments2 min readLW link

Task vec­tors & anal­ogy mak­ing in LLMs

Sergii8 Jan 2024 15:17 UTC
8 points
1 comment4 min readLW link
(grgv.xyz)

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
22 points
0 comments42 min readLW link

Sparse MLP Distillation

slavachalnev15 Jan 2024 19:39 UTC
30 points
3 comments6 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
81 points
5 comments19 min readLW link

How poly­se­man­tic can one neu­ron be? In­ves­ti­gat­ing fea­tures in TinyS­to­ries.

Evan Anders16 Jan 2024 19:10 UTC
12 points
0 comments8 min readLW link
(evanhanders.blog)

Toward A Math­e­mat­i­cal Frame­work for Com­pu­ta­tion in Superposition

18 Jan 2024 21:06 UTC
182 points
16 comments73 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesm26 Jan 2024 23:40 UTC
14 points
6 comments4 min readLW link

Ex­plor­ing OpenAI’s La­tent Direc­tions: Tests, Ob­ser­va­tions, and Pok­ing Around

Johnny Lin31 Jan 2024 6:01 UTC
26 points
4 comments14 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
88 points
37 comments15 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
75 points
4 comments8 min readLW link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
21 points
2 comments3 min readLW link

Fluent dream­ing for lan­guage mod­els (AI in­ter­pretabil­ity method)

6 Feb 2024 6:02 UTC
32 points
2 comments1 min readLW link
(arxiv.org)

A Chess-GPT Lin­ear Emer­gent World Representation

karvonenadam8 Feb 2024 4:25 UTC
102 points
11 comments7 min readLW link
(adamkarvonen.github.io)

Use­ful start­ing code for interpretability

eggsyntax13 Feb 2024 23:13 UTC
19 points
2 comments1 min readLW link

Ad­dress­ing Fea­ture Sup­pres­sion in SAEs

16 Feb 2024 18:32 UTC
72 points
2 comments10 min readLW link

Auto-match­ing hid­den lay­ers in Py­torch LLMs

chanind19 Feb 2024 12:40 UTC
2 points
0 comments3 min readLW link

Difficulty classes for al­ign­ment properties

Jozdien20 Feb 2024 9:08 UTC
32 points
5 comments2 min readLW link

Notes on In­ter­nal Ob­jec­tives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC
16 points
0 comments8 min readLW link

Do sparse au­toen­coders find “true fea­tures”?

Demian Till22 Feb 2024 18:06 UTC
69 points
32 comments11 min readLW link

The role of philo­soph­i­cal think­ing in un­der­stand­ing large lan­guage mod­els: Cal­ibrat­ing and clos­ing the gap be­tween first-per­son ex­pe­rience and un­der­ly­ing mechanisms

Bill Benzon23 Feb 2024 12:19 UTC
4 points
0 comments10 min readLW link

Hid­den Cog­ni­tion De­tec­tion Meth­ods and Bench­marks

Paul Colognese26 Feb 2024 5:31 UTC
21 points
11 comments4 min readLW link

Ex­am­in­ing Lan­guage Model Perfor­mance with Re­con­structed Ac­ti­va­tions us­ing Sparse Au­toen­coders

27 Feb 2024 2:43 UTC
39 points
16 comments15 min readLW link

Ti­maeus’s First Four Months

28 Feb 2024 17:01 UTC
166 points
5 comments6 min readLW link

What’s in the box?! – Towards in­ter­pretabil­ity by dis­t­in­guish­ing niches of value within neu­ral net­works.

Joshua Clancy29 Feb 2024 18:33 UTC
3 points
4 comments128 min readLW link

Ano­ma­lous Con­cept De­tec­tion for De­tect­ing Hid­den Cognition

Paul Colognese4 Mar 2024 16:52 UTC
22 points
3 comments10 min readLW link

Re­search Re­port: Sparse Au­toen­coders find only 9/​180 board state fea­tures in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC
52 points
24 comments10 min readLW link
(aizi.substack.com)

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
56 points
0 comments12 min readLW link

Ex­plor­ing the Evolu­tion and Mi­gra­tion of Differ­ent Layer Embed­ding in LLMs

Ruixuan Huang8 Mar 2024 15:01 UTC
4 points
0 comments8 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
53 points
0 comments14 min readLW link

Lay­ing the Foun­da­tions for Vi­sion and Mul­ti­modal Mechanis­tic In­ter­pretabil­ity & Open Problems

13 Mar 2024 17:09 UTC
38 points
12 comments14 min readLW link

Sparse au­toen­coders find com­posed fea­tures in small toy mod­els

14 Mar 2024 18:00 UTC
35 points
7 comments15 min readLW link

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

15 Mar 2024 16:30 UTC
11 points
5 comments4 min readLW link

AtP*: An effi­cient and scal­able method for lo­cal­iz­ing LLM be­havi­our to components

18 Mar 2024 17:28 UTC
19 points
0 comments1 min readLW link
(arxiv.org)

Mechanism for fea­ture learn­ing in neu­ral net­works and back­prop­a­ga­tion-free ma­chine learn­ing models

Matt Goldenberg19 Mar 2024 14:55 UTC
8 points
1 comment1 min readLW link
(www.science.org)

Stage­wise Devel­op­ment in Neu­ral Networks

20 Mar 2024 19:54 UTC
71 points
1 comment11 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
31 points
1 comment24 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

25 Mar 2024 21:17 UTC
73 points
5 comments7 min readLW link

Some find­ings from train­ing SAEs on Othello-GPT

vlad k27 Mar 2024 9:23 UTC
4 points
0 comments15 min readLW link

De­com­piling Tracr Trans­form­ers—An in­ter­pretabil­ity experiment

Hannes Thurnherr27 Mar 2024 9:49 UTC
0 points
0 comments14 min readLW link

Towards White Box Deep Learning

Maciej Satkiewicz27 Mar 2024 18:20 UTC
15 points
3 comments1 min readLW link
(arxiv.org)