RSS

Trans­parency /​ In­ter­pretabil­ity (ML & AI)

TagLast edit: 8 May 2022 5:16 UTC by Arjun Yadav

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
49 points
5 comments15 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
51 points
12 comments30 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
34 points
0 comments11 min readLW link

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

9 Apr 2021 19:19 UTC
127 points
15 comments102 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
51 points
15 comments8 min readLW link

Take­aways From 3 Years Work­ing In Ma­chine Learning

George8 Apr 2022 17:14 UTC
34 points
10 comments11 min readLW link
(www.epistem.ink)

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
41 points
0 comments59 min readLW link

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
191 points
38 comments12 min readLW link2 reviews

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
64 points
12 comments2 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
128 points
4 comments3 min readLW link
(transformer-circuits.pub)

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
102 points
28 comments5 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
53 points
45 comments8 min readLW link
(danielfilan.com)

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
34 points
6 comments2 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

How can In­ter­pretabil­ity help Align­ment?

23 May 2020 16:16 UTC
36 points
3 comments9 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
57 points
22 comments27 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
86 points
40 comments36 min readLW link1 review

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
124 points
38 comments11 min readLW link2 reviews

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
11 points
7 comments10 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
134 points
13 comments26 min readLW link

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
22 points
2 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
58 points
7 comments26 min readLW link

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
98 points
7 comments2 min readLW link
(arxiv.org)

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
57 points
10 comments47 min readLW link

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC
19 points
2 comments12 min readLW link

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven Byrnes7 Dec 2021 20:46 UTC
47 points
10 comments2 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan Riggs13 Dec 2021 17:09 UTC
11 points
5 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
104 points
9 comments16 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC
10 points
0 comments2 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
23 points
5 comments23 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:35 UTC
116 points
18 comments3 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworth4 Jun 2022 5:41 UTC
107 points
47 comments2 min readLW link

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworth9 Jun 2022 19:39 UTC
55 points
11 comments3 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC
4 points
0 comments2 min readLW link

[Question] Can you MRI a deep learn­ing model?

Yair Halberstadt13 Jun 2022 13:43 UTC
3 points
3 comments1 min readLW link

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
110 points
9 comments19 min readLW link

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
94 points
39 comments3 min readLW link2 reviews

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew Barnett4 Nov 2019 20:52 UTC
43 points
5 comments7 min readLW link

Ro­hin Shah on rea­sons for AI optimism

abergal31 Oct 2019 12:10 UTC
40 points
58 comments1 min readLW link
(aiimpacts.org)

Un­der­stand­ing understanding

mthq23 Aug 2019 18:10 UTC
24 points
1 comment2 min readLW link

in­ter­pret­ing GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC
137 points
32 comments11 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
93 comments2 min readLW link
(danielfilan.com)

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald Hobson7 Jun 2021 21:13 UTC
31 points
6 comments2 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

Jsevillamol29 Jun 2021 17:20 UTC
18 points
2 comments9 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
29 points
8 comments9 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

Visi­ble Thoughts Pro­ject and Bounty Announcement

So8res30 Nov 2021 0:19 UTC
239 points
104 comments13 min readLW link

Teaser: Hard-cod­ing Trans­former Models

MadHatter12 Dec 2021 22:04 UTC
71 points
19 comments1 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

TheMcDouglas14 Dec 2021 23:14 UTC
25 points
8 comments19 min readLW link

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC
9 points
2 comments1 min readLW link
(www.youtube.com)

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
18 points
6 comments8 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
45 points
3 comments6 min readLW link

Progress Re­port 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC
4 points
1 comment1 min readLW link

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC
5 points
0 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

WayZ7 Apr 2022 13:46 UTC
10 points
0 comments7 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
30 points
1 comment5 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHex26 May 2022 11:02 UTC
17 points
4 comments5 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
81 points
10 comments11 min readLW link

Thoughts on For­mal­iz­ing Composition

Tom Lieberum7 Jun 2022 7:51 UTC
10 points
0 comments7 min readLW link

Trans­former Re­search Ques­tions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC
4 points
0 comments2 min readLW link

Align­ment re­search for “meta” purposes

acylhalide16 Jun 2022 14:03 UTC
12 points
0 comments1 min readLW link
No comments.