RSS

Trans­former Circuits

Tag

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
5 comments2 min readLW link
(arxiv.org)

In­ter­pret­ing OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC
112 points
10 comments7 min readLW link

Un­der­stand­ing the ten­sor product for­mu­la­tion in Trans­former Circuits

Tom Lieberum24 Dec 2021 18:05 UTC
16 points
2 comments3 min readLW link

A Walk­through of In­ter­pretabil­ity in the Wild (w/​ au­thors Kevin Wang, Arthur Conmy & Alexan­dre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC
30 points
15 comments3 min readLW link
(youtu.be)

A Walk­through of In-Con­text Learn­ing and In­duc­tion Heads (w/​ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC
20 points
0 comments1 min readLW link
(www.youtube.com)

200 COP in MI: Look­ing for Cir­cuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC
16 points
5 comments13 min readLW link

200 COP in MI: In­ter­pret­ing Al­gorith­mic Problems

Neel Nanda31 Dec 2022 19:55 UTC
33 points
2 comments10 min readLW link

200 COP in MI: Ex­plor­ing Poly­se­man­tic­ity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC
34 points
6 comments16 min readLW link

200 COP in MI: Analysing Train­ing Dynamics

Neel Nanda4 Jan 2023 16:08 UTC
16 points
0 comments14 min readLW link

200 COP in MI: Tech­niques, Tool­ing and Automation

Neel Nanda6 Jan 2023 15:08 UTC
13 points
0 comments15 min readLW link

200 Con­crete Open Prob­lems in Mechanis­tic In­ter­pretabil­ity: Introduction

Neel Nanda28 Dec 2022 21:06 UTC
102 points
0 comments10 min readLW link

Paper Walk­through: Au­to­mated Cir­cuit Dis­cov­ery with Arthur Conmy

Neel Nanda29 Aug 2023 22:07 UTC
36 points
1 comment1 min readLW link
(www.youtube.com)

How to Think About Ac­ti­va­tion Patching

Neel Nanda4 Jun 2023 14:17 UTC
47 points
5 comments20 min readLW link
(www.neelnanda.io)

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

9 Dec 2023 2:27 UTC
66 points
5 comments10 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
43 points
3 comments2 min readLW link
(arxiv.org)

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix Hofstätter25 Apr 2023 13:45 UTC
8 points
0 comments15 min readLW link

Ad­den­dum: More Effi­cient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC
10 points
2 comments5 min readLW link
(aizi.substack.com)

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

16 Jan 2024 0:26 UTC
82 points
5 comments19 min readLW link

Can quan­tised au­toen­coders find and in­ter­pret cir­cuits in lan­guage mod­els?

charlieoneill24 Mar 2024 20:05 UTC
28 points
3 comments24 min readLW link

Poly­se­man­tic At­ten­tion Head in a 4-Layer Transformer

9 Nov 2023 16:16 UTC
46 points
0 comments6 min readLW link

AISC pro­ject: TinyEvals

Jett22 Nov 2023 20:47 UTC
17 points
0 comments4 min readLW link

An Anal­ogy for Un­der­stand­ing Transformers

CallumMcDougall13 May 2023 12:20 UTC
80 points
5 comments9 min readLW link

An In­ter­pretabil­ity Illu­sion for Ac­ti­va­tion Patch­ing of Ar­bi­trary Subspaces

29 Aug 2023 1:04 UTC
74 points
3 comments1 min readLW link

Au­to­mat­i­cally find­ing fea­ture vec­tors in the OV cir­cuits of Trans­form­ers with­out us­ing probing

Jacob Dunefsky12 Sep 2023 17:38 UTC
13 points
0 comments29 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
129 points
11 comments19 min readLW link

An ad­ver­sar­ial ex­am­ple for Direct Logit At­tri­bu­tion: mem­ory man­age­ment in gelu-4l

30 Aug 2023 17:36 UTC
17 points
0 comments8 min readLW link
(arxiv.org)

An­thropic’s SoLU (Soft­max Lin­ear Unit)

Joel Burget4 Jul 2022 18:38 UTC
21 points
1 comment4 min readLW link
(transformer-circuits.pub)

No Really, At­ten­tion is ALL You Need—At­ten­tion can do feed­for­ward networks

Robert_AIZI31 Jan 2023 18:48 UTC
29 points
7 comments6 min readLW link
(aizi.substack.com)