RSS

In­ter­pretabil­ity (ML & AI)

TagLast edit: 20 Oct 2022 23:51 UTC by Ruby

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

Re-Ex­am­in­ing LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC
105 points
8 comments5 min readLW link

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
52 points
5 comments15 min readLW link

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
198 points
38 comments12 min readLW link2 reviews

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
113 points
29 comments5 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
19 points
0 comments37 min readLW link

[Question] Papers to start get­ting into NLP-fo­cused al­ign­ment research

Feraidoon24 Sep 2022 23:53 UTC
6 points
0 comments1 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
77 points
6 comments14 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
52 points
12 comments30 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
140 points
10 comments18 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
34 points
0 comments11 min readLW link

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

9 Apr 2021 19:19 UTC
139 points
16 comments102 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
64 points
16 comments8 min readLW link1 review

Take­aways From 3 Years Work­ing In Ma­chine Learning

George3d68 Apr 2022 17:14 UTC
34 points
10 comments11 min readLW link
(www.epistem.ink)

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
51 points
0 comments59 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
350 points
44 comments36 min readLW link
(colab.research.google.com)

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

Maxwell Clarke8 Oct 2022 22:14 UTC
10 points
2 comments4 min readLW link

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
173 points
30 comments31 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
67 points
12 comments2 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
143 points
4 comments3 min readLW link
(transformer-circuits.pub)

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
100 points
8 comments33 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
128 points
27 comments33 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC
18 points
0 comments11 min readLW link

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

14 Dec 2022 14:33 UTC
28 points
5 comments11 min readLW link

A Com­pre­hen­sive Mechanis­tic In­ter­pretabil­ity Ex­plainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC
78 points
5 comments2 min readLW link
(neelnanda.io)

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel Nanda31 Dec 2022 19:55 UTC
31 points
1 comment8 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
646 points
194 comments12 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
54 points
45 comments8 min readLW link
(danielfilan.com)

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
35 points
6 comments2 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

How can In­ter­pretabil­ity help Align­ment?

23 May 2020 16:16 UTC
37 points
3 comments9 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
63 points
28 comments1 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
97 points
41 comments36 min readLW link1 review

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
137 points
39 comments11 min readLW link2 reviews

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
140 points
13 comments26 min readLW link

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
25 points
2 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
60 points
7 comments26 min readLW link

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
100 points
7 comments2 min readLW link
(arxiv.org)

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
57 points
10 comments47 min readLW link

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC
21 points
2 comments12 min readLW link

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven Byrnes7 Dec 2021 20:46 UTC
48 points
10 comments2 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan Riggs13 Dec 2021 17:09 UTC
11 points
5 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
121 points
9 comments15 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC
11 points
0 comments2 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
38 points
6 comments21 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:35 UTC
122 points
19 comments3 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworth4 Jun 2022 5:41 UTC
124 points
52 comments2 min readLW link

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworth9 Jun 2022 19:39 UTC
57 points
14 comments3 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC
4 points
0 comments2 min readLW link

[Question] Can you MRI a deep learn­ing model?

Yair Halberstadt13 Jun 2022 13:43 UTC
3 points
3 comments1 min readLW link

Vi­su­al­iz­ing Neu­ral net­works, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC
7 points
1 comment6 min readLW link

[Question] How op­ti­mistic should we be about AI figur­ing out how to in­ter­pret it­self?

oh5432125 Jul 2022 22:09 UTC
3 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
22 points
0 comments14 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
115 points
8 comments3 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
150 points
32 comments3 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar Jackson14 Aug 2022 10:31 UTC
6 points
1 comment5 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
21 points
2 comments16 min readLW link

What Makes A Good Mea­sure­ment De­vice?

johnswentworth24 Aug 2022 22:45 UTC
35 points
7 comments2 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
119 points
48 comments1 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasper12 Sep 2022 19:07 UTC
96 points
7 comments2 min readLW link
(arxiv.org)

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllis17 Sep 2022 19:48 UTC
19 points
13 comments3 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC
64 points
2 comments5 min readLW link
(transformer-circuits.pub)

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin Pope28 Sep 2022 16:02 UTC
48 points
2 comments10 min readLW link

More Re­cent Progress in the The­ory of Neu­ral Networks

jylin046 Oct 2022 16:57 UTC
80 points
6 comments4 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

7 Oct 2022 17:51 UTC
78 points
10 comments3 min readLW link

Ar­ti­cle Re­view: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC
8 points
2 comments10 min readLW link

[Question] Pre­vi­ous Work on Re­cre­at­ing Neu­ral Net­work In­put from In­ter­me­di­ate Layer Activations

bglass12 Oct 2022 19:28 UTC
1 point
3 comments1 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:08 UTC
70 points
5 comments12 min readLW link
(www.neelnanda.io)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC
63 points
12 comments3 min readLW link
(neelnanda.io)

A Walk­through of A Math­e­mat­i­cal Frame­work for Trans­former Circuits

Neel Nanda25 Oct 2022 20:24 UTC
49 points
5 comments1 min readLW link
(www.youtube.com)

[Book] In­ter­pretable Ma­chine Learn­ing: A Guide for Mak­ing Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC
19 points
1 comment1 min readLW link
(christophm.github.io)

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
47 points
25 comments2 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
69 points
14 comments1 min readLW link
(youtu.be)

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
29 points
15 comments3 min readLW link
(youtu.be)

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
20 points
0 comments1 min readLW link
(www.youtube.com)

Sub­sets and quo­tients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC
24 points
1 comment7 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC
229 points
36 comments8 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
92 points
7 comments16 min readLW link
(ai-alignment.com)

[ASoT] Nat­u­ral ab­strac­tions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC
33 points
1 comment1 min readLW link
(arxiv.org)

Paper: Trans­form­ers learn in-con­text by gra­di­ent descent

LawrenceC16 Dec 2022 11:10 UTC
28 points
11 comments2 min readLW link
(arxiv.org)

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristiano16 Dec 2022 19:40 UTC
64 points
3 comments9 min readLW link
(ai-alignment.com)

Durkon, an open-source tool for In­her­ently In­ter­pretable Modelling

abstractapplic24 Dec 2022 1:49 UTC
28 points
0 comments4 min readLW link

Con­crete Steps to Get Started in Trans­former Mechanis­tic Interpretability

Neel Nanda25 Dec 2022 22:21 UTC
47 points
7 comments12 min readLW link
(www.neelnanda.io)

Analo­gies be­tween Soft­ware Re­v­erse Eng­ineer­ing and Mechanis­tic Interpretability

26 Dec 2022 12:26 UTC
33 points
6 comments11 min readLW link
(www.neelnanda.io)

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
90 points
0 comments10 min readLW link

200 COP in MI: The Case for Analysing Toy Lan­guage Models

Neel Nanda28 Dec 2022 21:07 UTC
39 points
3 comments7 min readLW link

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC
15 points
5 comments12 min readLW link

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC
17 points
0 comments15 min readLW link

Com­ments on OpenPhil’s In­ter­pretabil­ity RFP

paulfchristiano5 Nov 2021 22:36 UTC
91 points
5 comments7 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel Nanda4 Jan 2023 16:08 UTC
17 points
0 comments13 min readLW link

Paper: Su­per­po­si­tion, Me­moriza­tion, and Dou­ble Des­cent (An­thropic)

LawrenceC5 Jan 2023 17:54 UTC
53 points
11 comments1 min readLW link
(transformer-circuits.pub)

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel Nanda6 Jan 2023 15:08 UTC
12 points
0 comments15 min readLW link

200 COP in MI: Image Model Interpretability

Neel Nanda8 Jan 2023 14:53 UTC
14 points
2 comments6 min readLW link

200 COP in MI: In­ter­pret­ing Re­in­force­ment Learning

Neel Nanda10 Jan 2023 17:37 UTC
21 points
1 comment10 min readLW link

World-Model In­ter­pretabil­ity Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC
30 points
17 comments21 min readLW link

How does GPT-3 spend its 175B pa­ram­e­ters?

Robert_AIZI13 Jan 2023 19:21 UTC
34 points
7 comments6 min readLW link
(aizi.substack.com)

200 COP in MI: Study­ing Learned Fea­tures in Lan­guage Models

Neel Nanda19 Jan 2023 3:48 UTC
25 points
2 comments29 min readLW link

[Question] Trans­former Mech In­terp: Any vi­su­al­iza­tions?

Joyee Chen18 Jan 2023 4:32 UTC
3 points
0 comments1 min readLW link

List of links: For­mal Meth­ods, Embed­ded Agency, 3d world mod­els, and some tools

the gears to ascension23 Jan 2023 18:21 UTC
14 points
1 comment17 min readLW link

Mechanis­tic In­ter­pretabil­ity Quick­start Guide

Neel Nanda31 Jan 2023 16:35 UTC
38 points
3 comments6 min readLW link
(www.neelnanda.io)

More find­ings on Me­moriza­tion and dou­ble descent

Marius Hobbhahn1 Feb 2023 18:26 UTC
49 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC
26 points
1 comment11 min readLW link

AXRP Epi­sode 19 - Mechanis­tic In­ter­pretabil­ity with Neel Nanda

DanielFilan4 Feb 2023 3:00 UTC
42 points
0 comments117 min readLW link

Mech In­terp Pro­ject Ad­vis­ing Call: Me­mori­sa­tion in GPT-2 Small

Neel Nanda4 Feb 2023 14:17 UTC
7 points
0 comments1 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC
9 points
2 comments1 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC
70 points
31 comments7 min readLW link
(bayesianinvestor.com)

On Devel­op­ing a Math­e­mat­i­cal The­ory of In­ter­pretabil­ity

Spencer Becker-Kahn9 Feb 2023 1:45 UTC
54 points
5 comments6 min readLW link

The con­cep­tual Dop­pelgänger problem

TsviBT12 Feb 2023 17:23 UTC
11 points
4 comments4 min readLW link

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasper16 Feb 2023 19:09 UTC
41 points
17 comments10 min readLW link

Ba­sic facts about lan­guage mod­els dur­ing training

beren21 Feb 2023 11:46 UTC
84 points
14 comments18 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
23 points
1 comment9 min readLW link

Video/​an­i­ma­tion: Neel Nanda ex­plains what mechanis­tic in­ter­pretabil­ity is

DanielFilan22 Feb 2023 22:42 UTC
24 points
7 comments1 min readLW link
(youtu.be)

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
286 points
15 comments22 min readLW link

Ad­den­dum: ba­sic facts about lan­guage mod­els dur­ing training

beren6 Mar 2023 19:24 UTC
20 points
2 comments5 min readLW link

The Translu­cent Thoughts Hy­pothe­ses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC
81 points
6 comments19 min readLW link

Paper Repli­ca­tion Walk­through: Re­v­erse-Eng­ineer­ing Mo­du­lar Addition

Neel Nanda12 Mar 2023 13:25 UTC
18 points
0 comments1 min readLW link
(neelnanda.io)

At­tri­bu­tion Patch­ing: Ac­ti­va­tion Patch­ing At In­dus­trial Scale

Neel Nanda16 Mar 2023 21:44 UTC
38 points
9 comments58 min readLW link
(www.neelnanda.io)

How Do In­duc­tion Heads Ac­tu­ally Work in Trans­form­ers With Finite Ca­pac­ity?

Fabien Roger23 Mar 2023 9:09 UTC
22 points
0 comments5 min readLW link

Wittgen­stein and ML — pa­ram­e­ters vs architecture

Cleo Nardo24 Mar 2023 4:54 UTC
34 points
5 comments5 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
103 points
39 comments3 min readLW link2 reviews

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew Barnett4 Nov 2019 20:52 UTC
43 points
5 comments7 min readLW link

Ro­hin Shah on rea­sons for AI optimism

abergal31 Oct 2019 12:10 UTC
40 points
58 comments1 min readLW link
(aiimpacts.org)

Un­der­stand­ing understanding

mthq23 Aug 2019 18:10 UTC
24 points
1 comment2 min readLW link

in­ter­pret­ing GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC
176 points
32 comments11 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
93 comments2 min readLW link
(danielfilan.com)

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald Hobson7 Jun 2021 21:13 UTC
31 points
6 comments2 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

Jsevillamol29 Jun 2021 17:20 UTC
18 points
2 comments9 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
29 points
8 comments9 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

Teaser: Hard-cod­ing Trans­former Models

MadHatter12 Dec 2021 22:04 UTC
73 points
19 comments1 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

TheMcDouglas14 Dec 2021 23:14 UTC
33 points
8 comments19 min readLW link

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC
11 points
2 comments1 min readLW link
(www.youtube.com)

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
60 points
4 comments6 min readLW link

Progress Re­port 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC
4 points
1 comment1 min readLW link

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC
5 points
0 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_c7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
33 points
1 comment5 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHex26 May 2022 11:02 UTC
17 points
4 comments5 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
82 points
10 comments11 min readLW link

Thoughts on For­mal­iz­ing Composition

Tom Lieberum7 Jun 2022 7:51 UTC
13 points
0 comments7 min readLW link

Trans­former Re­search Ques­tions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC
4 points
0 comments2 min readLW link

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
15 points
1 comment4 min readLW link
(transformer-circuits.pub)

Deep neu­ral net­works are not opaque.

jem-mosig6 Jul 2022 18:03 UTC
22 points
14 comments3 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
48 points
15 comments8 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

In­ter­pretabil­ity isn’t Free

Joel Burget4 Aug 2022 15:02 UTC
10 points
1 comment2 min readLW link

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
14 comments1 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
33 comments2 min readLW link

Solv­ing Align­ment by “solv­ing” semantics

Q Home27 Aug 2022 4:17 UTC
14 points
10 comments26 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points
3 comments14 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

Co­or­di­nate-Free In­ter­pretabil­ity Theory

johnswentworth14 Sep 2022 23:33 UTC
44 points
16 comments5 min readLW link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 3:48 UTC
34 points
4 comments1 min readLW link
(www.youtube.com)

Re­call and Re­gur­gi­ta­tion in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC
41 points
1 comment26 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatter13 Dec 2021 4:35 UTC
32 points
8 comments27 min readLW link

Vi­su­al­iz­ing Learned Rep­re­sen­ta­tions of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC
7 points
0 comments4 min readLW link
(indecisive-sand-24a.notion.site)

Nat­u­ral Cat­e­gories Update

Logan Zoellner10 Oct 2022 15:19 UTC
32 points
6 comments2 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
50 points
11 comments4 min readLW link

Interpretability

29 Oct 2021 7:28 UTC
59 points
13 comments12 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
17 points
4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
181 points
23 comments20 min readLW link

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
95 points
7 comments9 min readLW link
(arxiv.org)

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
29 points
1 comment7 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC
28 points
7 comments5 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
28 points
9 comments2 min readLW link

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
11 comments1 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
85 points
3 comments12 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

18 Nov 2022 1:43 UTC
73 points
6 comments3 min readLW link
(arxiv.org)

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
27 points
2 comments2 min readLW link

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
68 points
22 comments9 min readLW link

Us­ing mechanis­tic in­ter­pretabil­ity to find in-dis­tri­bu­tion failure in toy transformers

Charlie George28 Nov 2022 19:39 UTC
6 points
0 comments4 min readLW link

Multi-Com­po­nent Learn­ing and S-Curves

30 Nov 2022 1:37 UTC
57 points
24 comments7 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
33 points
2 comments30 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
0 comments17 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
19 points
10 comments2 min readLW link

An ex­plo­ra­tion of GPT-2′s em­bed­ding weights

Adam Scherlis13 Dec 2022 0:46 UTC
38 points
2 comments10 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
85 points
14 comments22 min readLW link

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

Collin15 Dec 2022 18:22 UTC
222 points
31 comments16 min readLW link

Why mechanis­tic in­ter­pretabil­ity does not and can­not con­tribute to long-term AGI safety (from mes­sages with a friend)

Remmelt19 Dec 2022 12:02 UTC
0 points
8 comments31 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

Spencer Becker-Kahn22 Dec 2022 17:21 UTC
13 points
0 comments12 min readLW link

In­ter­nal In­ter­faces Are a High-Pri­or­ity In­ter­pretabil­ity Target

Thane Ruthenis29 Dec 2022 17:49 UTC
29 points
6 comments7 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
86 points
1 comment18 min readLW link

[Question] Are Mix­ture-of-Ex­perts Trans­form­ers More In­ter­pretable Than Dense Trans­form­ers?

simeon_c31 Dec 2022 11:34 UTC
7 points
4 comments1 min readLW link

In­duc­tion heads—illustrated

TheMcDouglas2 Jan 2023 15:35 UTC
61 points
4 comments3 min readLW link

On the Im­por­tance of Open Sourc­ing Re­ward Models

elandgre2 Jan 2023 19:01 UTC
17 points
5 comments6 min readLW link

Ba­sic Facts about Lan­guage Model Internals

4 Jan 2023 13:01 UTC
115 points
17 comments9 min readLW link

AI psy­chol­ogy should ground the the­o­ries of AI con­scious­ness and in­form hu­man-AI eth­i­cal in­ter­ac­tion design

Roman Leventov8 Jan 2023 6:37 UTC
19 points
8 comments2 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC
44 points
14 comments8 min readLW link

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybear13 Jan 2023 0:28 UTC
11 points
3 comments12 min readLW link

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC
85 points
25 comments16 min readLW link
(ai-alignment.com)

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC
117 points
25 comments15 min readLW link
(www.jessehoogland.com)

Reflec­tions on Trust­ing Trust & AI

Itay Yona16 Jan 2023 6:36 UTC
10 points
1 comment3 min readLW link
(mentaleap.ai)

Large lan­guage mod­els learn to rep­re­sent the world

gjm22 Jan 2023 13:10 UTC
100 points
19 comments3 min readLW link

De­con­fus­ing “Ca­pa­bil­ities vs. Align­ment”

RobertM23 Jan 2023 4:46 UTC
27 points
7 comments2 min readLW link

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC
40 points
3 comments13 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
41 points
6 comments12 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
53 points
4 comments3 min readLW link
(www.jessehoogland.com)

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZI31 Jan 2023 18:48 UTC
22 points
2 comments6 min readLW link
(aizi.substack.com)

ChatGPT: Tan­tal­iz­ing af­terthoughts in search of story tra­jec­to­ries [in­duc­tion heads]

Bill Benzon3 Feb 2023 10:35 UTC
4 points
0 comments20 min readLW link

Some mis­cel­la­neous thoughts on ChatGPT, sto­ries, and me­chan­i­cal interpretability

Bill Benzon4 Feb 2023 19:35 UTC
2 points
0 comments3 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
74 points
4 comments21 min readLW link

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC
30 points
6 comments5 min readLW link
(www.jessehoogland.com)

De­ci­sion Trans­former Interpretability

6 Feb 2023 7:29 UTC
78 points
12 comments24 min readLW link

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC
8 points
0 comments5 min readLW link
(aizi.substack.com)

LLM Ba­sics: Embed­ding Spaces—Trans­former To­ken Vec­tors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC
42 points
9 comments15 min readLW link

A multi-dis­ci­plinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC
36 points
4 comments26 min readLW link

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro

scasper9 Feb 2023 16:28 UTC
41 points
23 comments3 min readLW link

EIS II: What is “In­ter­pretabil­ity”?

scasper9 Feb 2023 16:48 UTC
13 points
3 comments4 min readLW link

We Found An Neu­ron in GPT-2

11 Feb 2023 18:27 UTC
136 points
21 comments7 min readLW link
(clementneo.com)

Idea: Net­work mod­u­lar­ity and in­ter­pretabil­ity by sex­ual reproduction

qbolec12 Feb 2023 23:06 UTC
3 points
3 comments1 min readLW link

Ex­plain­ing SolidGoldMag­ikarp by look­ing at it from ran­dom directions

Robert_AIZI14 Feb 2023 14:54 UTC
8 points
0 comments8 min readLW link
(aizi.substack.com)

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

scasper14 Feb 2023 18:24 UTC
11 points
2 comments11 min readLW link

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

scasper15 Feb 2023 18:46 UTC
11 points
1 comment4 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasper17 Feb 2023 20:48 UTC
32 points
6 comments12 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC
22 points
4 comments3 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
23 points
11 comments3 min readLW link
(www.jessehoogland.com)

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
14 points
4 comments4 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
18 points
1 comment8 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
12 points
3 comments3 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
13 points
2 comments9 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
36 points
0 comments19 min readLW link

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
12 points
0 comments6 min readLW link

In­ter­pret­ing Embed­ding Spaces by Conceptualization

Adi Simhi28 Feb 2023 18:38 UTC
3 points
0 comments1 min readLW link
(arxiv.org)

In­side the mind of a su­per­hu­man Go model: How does Leela Zero read lad­ders?

Haoxing Du1 Mar 2023 1:47 UTC
137 points
8 comments30 min readLW link

My cur­rent think­ing about ChatGPT @3QD [Gär­den­fors, Wolfram, and the value of spec­u­la­tion]

Bill Benzon1 Mar 2023 10:50 UTC
2 points
0 comments5 min readLW link

ChatGPT tells sto­ries, and a note about re­verse en­g­ineer­ing: A Work­ing Paper

Bill Benzon3 Mar 2023 15:12 UTC
3 points
0 comments3 min readLW link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 16:16 UTC
90 points
11 comments1 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC
126 points
16 comments18 min readLW link
(www.erichgrunewald.com)

Creat­ing a Dis­cord server for Mechanis­tic In­ter­pretabil­ity Projects

Victor Levoso12 Mar 2023 18:00 UTC
25 points
6 comments2 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse Hoogland22 Mar 2023 16:58 UTC
23 points
5 comments10 min readLW link
No comments.