RSS

In­ter­pretabil­ity (ML & AI)

TagLast edit: 20 Oct 2022 23:51 UTC by Ruby

Transparency and interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

In­ter­pretabil­ity in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC
52 points
5 comments15 min readLW link

Re-Ex­am­in­ing LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC
90 points
8 comments5 min readLW link

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC
197 points
38 comments12 min readLW link2 reviews

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
17 points
0 comments37 min readLW link

[Question] Papers to start get­ting into NLP-fo­cused al­ign­ment research

Feraidoon24 Sep 2022 23:53 UTC
6 points
0 comments1 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
61 points
6 comments14 min readLW link

Trans­parency and AGI safety

jylin0411 Jan 2021 18:51 UTC
52 points
12 comments30 min readLW link

A Longlist of The­o­ries of Im­pact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC
106 points
29 comments5 min readLW link

What is In­ter­pretabil­ity?

17 Mar 2020 20:23 UTC
34 points
0 comments11 min readLW link

Opinions on In­ter­pretable Ma­chine Learn­ing and 70 Sum­maries of Re­cent Papers

9 Apr 2021 19:19 UTC
132 points
16 comments102 min readLW link

The Case for Rad­i­cal Op­ti­mism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC
56 points
15 comments8 min readLW link

Take­aways From 3 Years Work­ing In Ma­chine Learning

George3d68 Apr 2022 17:14 UTC
34 points
10 comments11 min readLW link
(www.epistem.ink)

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
45 points
0 comments59 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
337 points
39 comments42 min readLW link
(colab.research.google.com)

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
158 points
26 comments31 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
67 points
12 comments2 min readLW link

[Pro­posal] Method of lo­cat­ing use­ful sub­nets in large models

Quintin Pope13 Oct 2021 20:52 UTC
9 points
0 comments2 min readLW link

In­tro­duc­tion to in­ac­cessible information

Ryan Kidd9 Dec 2021 1:28 UTC
27 points
6 comments8 min readLW link

Trans­former Circuits

evhub22 Dec 2021 21:09 UTC
142 points
4 comments3 min readLW link
(transformer-circuits.pub)

A trans­parency and in­ter­pretabil­ity tech tree

evhub16 Jun 2022 23:44 UTC
135 points
10 comments19 min readLW link

Cir­cum­vent­ing in­ter­pretabil­ity: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC
92 points
8 comments36 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
123 points
26 comments33 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC
16 points
0 comments11 min readLW link

An An­a­lytic Per­spec­tive on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC
53 points
45 comments8 min readLW link
(danielfilan.com)

Ver­ifi­ca­tion and Transparency

DanielFilan8 Aug 2019 1:50 UTC
34 points
6 comments2 min readLW link
(danielfilan.com)

Mechanis­tic Trans­parency for Ma­chine Learning

DanielFilan11 Jul 2018 0:34 UTC
54 points
9 comments4 min readLW link

How can In­ter­pretabil­ity help Align­ment?

23 May 2020 16:16 UTC
37 points
3 comments9 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC
26 points
28 comments5 min readLW link

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub10 Sep 2019 23:03 UTC
61 points
28 comments1 min readLW link

Spar­sity and in­ter­pretabil­ity?

1 Jun 2020 13:25 UTC
41 points
3 comments7 min readLW link

Search ver­sus design

Alex Flint16 Aug 2020 16:53 UTC
89 points
40 comments36 min readLW link1 review

In­ner Align­ment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC
136 points
39 comments11 min readLW link2 reviews

Multi-di­men­sional re­wards for AGI in­ter­pretabil­ity and control

Steven Byrnes4 Jan 2021 3:08 UTC
19 points
8 comments10 min readLW link

MIRI com­ments on Co­tra’s “Case for Align­ing Nar­rowly Su­per­hu­man Models”

Rob Bensinger5 Mar 2021 23:43 UTC
136 points
13 comments26 min readLW link

Trans­parency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC
25 points
2 comments7 min readLW link

Solv­ing the whole AGI con­trol prob­lem, ver­sion 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC
59 points
7 comments26 min readLW link

Knowl­edge Neu­rons in Pre­trained Transformers

evhub17 May 2021 22:54 UTC
98 points
7 comments2 min readLW link
(arxiv.org)

Garrabrant and Shah on hu­man mod­el­ing in AGI

Rob Bensinger4 Aug 2021 4:35 UTC
57 points
10 comments47 min readLW link

Neu­ral net /​ de­ci­sion tree hy­brids: a po­ten­tial path to­ward bridg­ing the in­ter­pretabil­ity gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC
21 points
2 comments12 min readLW link

Let’s buy out Cyc, for use in AGI in­ter­pretabil­ity sys­tems?

Steven Byrnes7 Dec 2021 20:46 UTC
47 points
10 comments2 min readLW link

Solv­ing In­ter­pretabil­ity Week

Logan Riggs13 Dec 2021 17:09 UTC
11 points
5 comments1 min readLW link

My Overview of the AI Align­ment Land­scape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC
111 points
9 comments15 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Progress Re­port 1: in­ter­pretabil­ity ex­per­i­ments & learn­ing, test­ing com­pres­sion hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC
11 points
0 comments2 min readLW link

[In­tro to brain-like-AGI safety] 9. Take­aways from neuro 2/​2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC
31 points
6 comments23 min readLW link

The case for be­com­ing a black-box in­ves­ti­ga­tor of lan­guage models

Buck6 May 2022 14:35 UTC
118 points
19 comments3 min readLW link

Deep Learn­ing Sys­tems Are Not Less In­ter­pretable Than Logic/​Prob­a­bil­ity/​Etc

johnswentworth4 Jun 2022 5:41 UTC
117 points
52 comments2 min readLW link

How Do Selec­tion The­o­rems Re­late To In­ter­pretabil­ity?

johnswentworth9 Jun 2022 19:39 UTC
56 points
14 comments3 min readLW link

Progress Re­port 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC
4 points
0 comments2 min readLW link

[Question] Can you MRI a deep learn­ing model?

Yair Halberstadt13 Jun 2022 13:43 UTC
3 points
3 comments1 min readLW link

Vi­su­al­iz­ing Neu­ral net­works, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC
7 points
1 comment6 min readLW link

[Question] How op­ti­mistic should we be about AI figur­ing out how to in­ter­pret it­self?

oh5432125 Jul 2022 22:09 UTC
3 points
1 comment1 min readLW link

Pre­cur­sor check­ing for de­cep­tive alignment

evhub3 Aug 2022 22:56 UTC
18 points
0 comments14 min readLW link

In­ter­pretabil­ity/​Tool-ness/​Align­ment/​Cor­rigi­bil­ity are not Composable

johnswentworth8 Aug 2022 18:05 UTC
111 points
8 comments3 min readLW link

How To Go From In­ter­pretabil­ity To Align­ment: Just Re­tar­get The Search

johnswentworth10 Aug 2022 16:08 UTC
142 points
30 comments3 min readLW link

AI Trans­parency: Why it’s crit­i­cal and how to ob­tain it.

Zohar Jackson14 Aug 2022 10:31 UTC
6 points
1 comment5 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
17 points
2 comments16 min readLW link

What Makes A Good Mea­sure­ment De­vice?

johnswentworth24 Aug 2022 22:45 UTC
35 points
7 comments2 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
117 points
48 comments1 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael Soareverix8 Sep 2022 15:20 UTC
2 points
2 comments2 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

[Linkpost] A sur­vey on over 300 works about in­ter­pretabil­ity in deep networks

scasper12 Sep 2022 19:07 UTC
96 points
7 comments2 min readLW link
(arxiv.org)

Sparse tri­nary weighted RNNs as a path to bet­ter lan­guage model interpretability

Am8ryllis17 Sep 2022 19:48 UTC
19 points
13 comments3 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC
63 points
2 comments5 min readLW link
(transformer-circuits.pub)

QAPR 3: in­ter­pretabil­ity-guided train­ing of neu­ral nets

Quintin Pope28 Sep 2022 16:02 UTC
47 points
2 comments10 min readLW link

More Re­cent Progress in the The­ory of Neu­ral Networks

jylin046 Oct 2022 16:57 UTC
78 points
6 comments4 min readLW link

Poly­se­man­tic­ity and Ca­pac­ity in Neu­ral Networks

7 Oct 2022 17:51 UTC
78 points
9 comments3 min readLW link

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

Maxwell Clarke8 Oct 2022 22:14 UTC
10 points
2 comments4 min readLW link

Ar­ti­cle Re­view: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC
8 points
2 comments10 min readLW link

[Question] Pre­vi­ous Work on Re­cre­at­ing Neu­ral Net­work In­put from In­ter­me­di­ate Layer Activations

bglass12 Oct 2022 19:28 UTC
1 point
3 comments1 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers

Neel Nanda18 Oct 2022 21:08 UTC
66 points
5 comments12 min readLW link
(www.neelnanda.io)

A Bare­bones Guide to Mechanis­tic In­ter­pretabil­ity Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC
62 points
8 comments3 min readLW link
(neelnanda.io)

A Walk­through of A Math­e­mat­i­cal Frame­work for Trans­former Circuits

Neel Nanda25 Oct 2022 20:24 UTC
49 points
5 comments1 min readLW link
(www.youtube.com)

[Book] In­ter­pretable Ma­chine Learn­ing: A Guide for Mak­ing Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC
19 points
1 comment1 min readLW link
(christophm.github.io)

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
47 points
25 comments2 min readLW link

Real-Time Re­search Record­ing: Can a Trans­former Re-Derive Po­si­tional Info?

Neel Nanda1 Nov 2022 23:56 UTC
68 points
14 comments1 min readLW link
(youtu.be)

A Mys­tery About High Di­men­sional Con­cept Encoding

Fabien Roger3 Nov 2022 17:05 UTC
46 points
13 comments7 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
29 points
15 comments3 min readLW link
(youtu.be)

The In­ter­pretabil­ity Playground

Esben Kran10 Nov 2022 17:15 UTC
8 points
0 comments1 min readLW link
(alignmentjam.com)

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
15 points
0 comments1 min readLW link
(www.youtube.com)

Sub­sets and quo­tients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC
24 points
1 comment7 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC
196 points
26 comments8 min readLW link

Find­ing gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC
91 points
5 comments16 min readLW link
(ai-alignment.com)

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
99 points
39 comments3 min readLW link2 reviews

Will trans­parency help catch de­cep­tion? Per­haps not

Matthew Barnett4 Nov 2019 20:52 UTC
43 points
5 comments7 min readLW link

Ro­hin Shah on rea­sons for AI optimism

abergal31 Oct 2019 12:10 UTC
40 points
58 comments1 min readLW link
(aiimpacts.org)

Un­der­stand­ing understanding

mthq23 Aug 2019 18:10 UTC
24 points
1 comment2 min readLW link

in­ter­pret­ing GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC
157 points
32 comments11 min readLW link

Challenge: know ev­ery­thing that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC
48 points
93 comments2 min readLW link
(danielfilan.com)

Spec­u­la­tions against GPT-n writ­ing al­ign­ment papers

Donald Hobson7 Jun 2021 21:13 UTC
31 points
6 comments2 min readLW link

Try­ing to ap­prox­i­mate Statis­ti­cal Models as Scor­ing Tables

Jsevillamol29 Jun 2021 17:20 UTC
18 points
2 comments9 min readLW link

Pos­si­ble re­search di­rec­tions to im­prove the mechanis­tic ex­pla­na­tion of neu­ral networks

delton1379 Nov 2021 2:36 UTC
29 points
8 comments9 min readLW link

[linkpost] Ac­qui­si­tion of Chess Knowl­edge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC
8 points
1 comment1 min readLW link

Teaser: Hard-cod­ing Trans­former Models

MadHatter12 Dec 2021 22:04 UTC
73 points
19 comments1 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

TheMcDouglas14 Dec 2021 23:14 UTC
30 points
8 comments19 min readLW link

Mechanis­tic In­ter­pretabil­ity for the MLP Lay­ers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC
11 points
2 comments1 min readLW link
(www.youtube.com)

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

Gears-Level Men­tal Models of Trans­former Interpretability

KevinRoWang29 Mar 2022 20:09 UTC
56 points
4 comments6 min readLW link

Progress Re­port 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC
4 points
1 comment1 min readLW link

Progress re­port 3: clus­ter­ing trans­former neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC
5 points
0 comments2 min readLW link

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

WayZ7 Apr 2022 13:46 UTC
11 points
0 comments7 min readLW link

Progress Re­port 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC
3 points
0 comments2 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
32 points
1 comment5 min readLW link

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

Evan R. Murphy12 May 2022 19:59 UTC
16 points
0 comments8 min readLW link

CNN fea­ture vi­su­al­iza­tion in 50 lines of code

StefanHex26 May 2022 11:02 UTC
17 points
4 comments5 min readLW link

QNR prospects are im­por­tant for AI al­ign­ment research

Eric Drexler3 Feb 2022 15:20 UTC
82 points
10 comments11 min readLW link

Thoughts on For­mal­iz­ing Composition

Tom Lieberum7 Jun 2022 7:51 UTC
13 points
0 comments7 min readLW link

Trans­former Re­search Ques­tions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC
4 points
0 comments2 min readLW link

Align­ment re­search for “meta” purposes

acylhalide16 Jun 2022 14:03 UTC
15 points
0 comments1 min readLW link

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
15 points
1 comment4 min readLW link
(transformer-circuits.pub)

Deep neu­ral net­works are not opaque.

jem-mosig6 Jul 2022 18:03 UTC
22 points
14 comments3 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
49 points
15 comments8 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

In­ter­pretabil­ity isn’t Free

Joel Burget4 Aug 2022 15:02 UTC
10 points
1 comment2 min readLW link

Dis­sected boxed AI

Nathan112312 Aug 2022 2:37 UTC
−8 points
2 comments1 min readLW link

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
42 points
22 comments1 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk2718217 Sep 2022 0:28 UTC
5 points
0 comments6 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
33 comments2 min readLW link

Solv­ing Align­ment by “solv­ing” semantics

Q Home27 Aug 2022 4:17 UTC
15 points
10 comments26 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
43 points
3 comments14 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

Co­or­di­nate-Free In­ter­pretabil­ity Theory

johnswentworth14 Sep 2022 23:33 UTC
41 points
14 comments5 min readLW link

Math­e­mat­i­cal Cir­cuits in Neu­ral Networks

Sean Osier22 Sep 2022 3:48 UTC
34 points
4 comments1 min readLW link
(www.youtube.com)

Re­call and Re­gur­gi­ta­tion in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC
33 points
1 comment26 min readLW link

Hard-Cod­ing Neu­ral Computation

MadHatter13 Dec 2021 4:35 UTC
32 points
8 comments27 min readLW link

Vi­su­al­iz­ing Learned Rep­re­sen­ta­tions of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC
7 points
0 comments4 min readLW link
(indecisive-sand-24a.notion.site)

Nat­u­ral Cat­e­gories Update

Logan Zoellner10 Oct 2022 15:19 UTC
29 points
6 comments2 min readLW link

Help out Red­wood Re­search’s in­ter­pretabil­ity team by find­ing heuris­tics im­ple­mented by GPT-2 small

12 Oct 2022 21:25 UTC
49 points
11 comments4 min readLW link

Interpretability

29 Oct 2021 7:28 UTC
59 points
13 comments12 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
15 points
0 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
124 points
9 comments20 min readLW link

Some Les­sons Learned from Study­ing Indi­rect Ob­ject Iden­ti­fi­ca­tion in GPT-2 small

28 Oct 2022 23:55 UTC
86 points
5 comments9 min readLW link
(arxiv.org)

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
26 points
1 comment7 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC
27 points
7 comments5 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Cooper11 Nov 2022 9:24 UTC
28 points
9 comments2 min readLW link

The limited up­side of interpretability

Peter S. Park15 Nov 2022 18:46 UTC
13 points
10 comments1 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
82 points
3 comments12 min readLW link

Eng­ineer­ing Monose­man­tic­ity in Toy Models

18 Nov 2022 1:43 UTC
70 points
6 comments3 min readLW link
(arxiv.org)

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Cooper17 Nov 2022 11:06 UTC
26 points
2 comments2 min readLW link

By De­fault, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC
58 points
16 comments9 min readLW link

Us­ing mechanis­tic in­ter­pretabil­ity to find in-dis­tri­bu­tion failure in toy transformers

Charlie George28 Nov 2022 19:39 UTC
6 points
0 comments4 min readLW link

Multi-Com­po­nent Learn­ing and S-Curves

30 Nov 2022 1:37 UTC
57 points
24 comments7 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
25 points
0 comments30 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
32 points
0 comments17 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
15 points
9 comments2 min readLW link