Transformer Circuits

Tag

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

May 3, 2023, 1:30 PM

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy

Neel NandaAug 29, 2023, 10:07 PM

36 points

1 comment1 min readLW link

(www.youtube.com)

How to Think About Activation Patching

Neel NandaJun 4, 2023, 2:17 PM

50 points

5 comments20 min readLW link

(www.neelnanda.io)

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

Ilia Shirokov and Ilya Nachevsky

Apr 4, 2025, 8:49 PM

16 points

2 comments7 min readLW link

200 COP in MI: Interpreting Algorithmic Problems

Neel NandaDec 31, 2022, 7:55 PM

33 points

2 comments10 min readLW link

200 COP in MI: Techniques, Tooling and Automation

Neel NandaJan 6, 2023, 3:08 PM

13 points

0 comments15 min readLW link

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel NandaNov 7, 2022, 10:39 PM

30 points

15 comments3 min readLW link

(youtu.be)

200 COP in MI: Analysing Training Dynamics

Neel NandaJan 4, 2023, 4:08 PM

16 points

0 comments14 min readLW link

Explaining the Transformer Circuits Framework by Example

Felix HofstätterApr 25, 2023, 1:45 PM

8 points

0 comments15 min readLW link

Understanding the tensor product formulation in Transformer Circuits

Tom LieberumDec 24, 2021, 6:05 PM

16 points

2 comments3 min readLW link

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Neel NandaNov 22, 2022, 5:12 PM

20 points

0 comments1 min readLW link

(www.youtube.com)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel NandaJul 7, 2024, 5:39 PM

135 points

16 comments25 min readLW link

200 COP in MI: Exploring Polysemanticity and Superposition

Neel NandaJan 3, 2023, 1:52 AM

34 points

6 comments16 min readLW link

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

Jul 20, 2023, 10:50 AM

44 points

3 comments2 min readLW link

(arxiv.org)

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

Dec 9, 2023, 2:27 AM

70 points

5 comments10 min readLW link

200 COP in MI: Looking for Circuits in the Wild

Neel NandaDec 29, 2022, 8:59 PM

16 points

5 comments13 min readLW link

200 Concrete Open Problems in Mechanistic Interpretability: Introduction

Neel NandaDec 28, 2022, 9:06 PM

106 points

0 comments10 min readLW link

Interpreting OpenAI’s Whisper

EllenaRSep 24, 2023, 5:53 PM

116 points

13 comments7 min readLW link

Graphical tensor notation for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM

141 points

11 comments19 min readLW link

Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.

Yash SrivastavaJul 17, 2024, 2:02 AM

3 points

2 comments5 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

Aug 29, 2023, 1:04 AM

77 points

4 comments1 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jan 16, 2024, 12:26 AM

84 points

9 comments18 min readLW link

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ntt123Jun 17, 2024, 11:46 AM

5 points

4 comments6 min readLW link

(neuralblog.github.io)

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob DunefskySep 12, 2023, 5:38 PM

16 points

2 comments29 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

May 28, 2024, 5:29 AM

50 points

1 comment9 min readLW link

(arxiv.org)

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM

9 points

1 comment4 min readLW link

No Really, Attention is ALL You Need—Attention can do feedforward networks

Robert_AIZIJan 31, 2023, 6:48 PM

29 points

7 comments6 min readLW link

(aizi.substack.com)

AISC project: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM

22 points

0 comments4 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jul 18, 2024, 10:29 AM

66 points

0 comments10 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

Nov 9, 2023, 4:16 PM

51 points

0 comments6 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Oct 27, 2024, 6:46 PM

47 points

4 comments5 min readLW link

Anthropic’s SoLU (Softmax Linear Unit)

Joel BurgetJul 4, 2022, 6:38 PM

21 points

1 comment4 min readLW link

(transformer-circuits.pub)

Addendum: More Efficient FFNs via Attention

Robert_AIZIFeb 6, 2023, 6:55 PM

10 points

2 comments5 min readLW link

(aizi.substack.com)

An Analogy for Understanding Transformers

CallumMcDougallMay 13, 2023, 12:20 PM

89 points

6 comments9 min readLW link

Concrete Methods for Heuristic Estimation on Neural Networks

Oliver DanielsNov 14, 2024, 5:07 AM

28 points

0 comments27 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

Aug 30, 2023, 5:36 PM

17 points

0 comments8 min readLW link

(arxiv.org)

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

Jan 10, 2025, 11:08 AM

86 points

11 comments17 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM

5 points

2 comments10 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneillMar 24, 2024, 8:05 PM

29 points

4 comments24 min readLW link

Multicore 29 Dec 2021 4:07 UTC
1 point
“Transformer Circuits” seems like too specific of a tag—I doubt it applies to much beyond this one post. Probably should be broadened to encompass https://www.lesswrong.com/posts/MG4ZjWQDrdpgeu8wG/zoom-in-an-introduction-to-circuits and related stuff.
“Circuits (AI)” to distinguish from normal electronic circuits?