Interpretability (ML & AI)

TagLast edit: Jan 22, 2025, 4:27 PM by Dakara

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

Apr 30, 2023, 7:54 PM

61 points

5 comments1 min readLW link

Interpretability in ML: A Broad Overview

lifelonglearnerAug 4, 2020, 7:03 PM

53 points

5 comments15 min readLW link

Timaeus’s First Four Months

Jesse Hoogland, Daniel Murfet, Stan van Wingerden and Alexander Gietelink Oldenziel

Feb 28, 2024, 5:01 PM

173 points

6 comments6 min readLW link

Toward A Mathematical Framework for Computation in Superposition

Dmitry Vaintrob, jake_mendel and Kaarel

Jan 18, 2024, 9:06 PM

205 points

18 comments63 min readLW link

A Mechanistic Interpretability Analysis of Grokking

Neel Nanda and Tom Lieberum

Aug 15, 2022, 2:41 AM

373 points

48 comments36 min readLW link 1 review

(colab.research.google.com)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

Dec 13, 2022, 3:41 PM

150 points

23 comments22 min readLW link 2 reviews

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

Feb 7, 2025, 7:35 PM

71 points

12 comments14 min readLW link

A Longlist of Theories of Impact for Interpretability

Neel NandaMar 11, 2022, 2:55 PM

127 points

41 comments5 min readLW link 2 reviews

200 Concrete Open Problems in Mechanistic Interpretability: Introduction

Neel NandaDec 28, 2022, 9:06 PM

106 points

0 comments10 min readLW link

Re-Examining LayerNorm

Eric WinsorDec 1, 2022, 10:20 PM

127 points

12 comments5 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

May 3, 2023, 1:30 PM

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

Chris Olah’s views on AGI safety

evhubNov 1, 2019, 8:13 PM

208 points

38 comments12 min readLW link 2 reviews

Try training token-level probes

StefanHexApr 14, 2025, 11:56 AM

46 points

6 comments8 min readLW link

Searching for Search

NicholasKees and janus

Nov 28, 2022, 3:31 PM

97 points

9 comments14 min readLW link 1 review

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren and Sid Black

Nov 28, 2022, 12:54 PM

200 points

34 comments31 min readLW link

A Rocket–Interpretability Analogy

plexOct 21, 2024, 1:55 PM

155 points

31 comments1 min readLW link

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworthAug 10, 2022, 4:08 PM

210 points

34 comments3 min readLW link 1 review

[Question] Papers to start getting into NLP-focused alignment research

FeraidoonSep 24, 2022, 11:53 PM

6 points

0 comments1 min readLW link

Tracing the Thoughts of a Large Language Model

Adam JermynMar 27, 2025, 5:20 PM

304 points

24 comments10 min readLW link

(www.anthropic.com)

Against Almost Every Theory of Impact of Interpretability

Charbel-RaphaëlAug 17, 2023, 6:44 PM

329 points

91 comments26 min readLW link 2 reviews

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-DoddsOct 5, 2023, 9:01 PM

288 points

22 comments2 min readLW link 1 review

(transformer-circuits.pub)

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish MudideJul 22, 2024, 6:45 PM

118 points

20 comments12 min readLW link

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

May 7, 2023, 12:46 AM

77 points

24 comments11 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee SharkeyApr 3, 2024, 12:34 PM

96 points

23 comments22 min readLW link

A transparency and interpretability tech tree

evhubJun 16, 2022, 11:44 PM

163 points

11 comments18 min readLW link 1 review

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini and Ciprian Florea

Sep 29, 2024, 7:37 PM

26 points

8 comments25 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

May 30, 2023, 4:17 PM

217 points

11 comments8 min readLW link

How Interpretability can be Impactful

Connall GarrodJul 18, 2022, 12:06 AM

18 points

0 comments37 min readLW link

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

Sep 23, 2022, 5:58 PM

144 points

29 comments33 min readLW link

ParaScopes: Do Language Models Plan the Upcoming Paragraph?

NickyPFeb 21, 2025, 4:50 PM

36 points

2 comments20 min readLW link

How to use and interpret activation patching

StefanHex and Neel Nanda

Apr 24, 2024, 8:35 AM

13 points

6 comments18 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM

37 points

4 comments2 min readLW link

Transparency and AGI safety

jylin04Jan 11, 2021, 6:51 PM

54 points

12 comments30 min readLW link

The ‘strong’ feature hypothesis could be wrong

lewis smithAug 2, 2024, 2:33 PM

231 points

19 comments17 min readLW link

Ideation and Trajectory Modelling in Language Models

NickyPOct 5, 2023, 7:21 PM

16 points

2 comments10 min readLW link

Transformer Circuits

evhubDec 22, 2021, 9:09 PM

144 points

4 comments3 min readLW link

(transformer-circuits.pub)

Takeaways From 3 Years Working In Machine Learning

George3d6Apr 8, 2022, 5:14 PM

35 points

10 comments11 min readLW link

(www.epistem.ink)

SAE reconstruction errors are (empirically) pathological

wesgMar 29, 2024, 4:37 PM

106 points

16 comments8 min readLW link

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

Jan 25, 2025, 1:12 PM

107 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph BloomFeb 2, 2024, 6:54 AM

103 points

37 comments15 min readLW link

SAE regularization produces more interpretable models

Peter Lai and StefanHex

Jan 28, 2025, 8:02 PM

21 points

7 comments4 min readLW link

Interpreting the Learning of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM

30 points

14 comments9 min readLW link

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

May 13, 2023, 6:42 PM

437 points

98 comments50 min readLW link 1 review

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

Mar 17, 2020, 8:23 PM

39 points

1 comment11 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

Oct 23, 2023, 4:33 PM

33 points

2 comments11 min readLW link

SAE feature geometry is outside the superposition hypothesis

jake_mendelJun 24, 2024, 4:07 PM

228 points

17 comments11 min readLW link

The Case for Radical Optimism about Interpretability

Quintin PopeDec 16, 2021, 11:38 PM

66 points

16 comments8 min readLW link 1 review

Actually, Othello-GPT Has A Linear Emergent World Representation

Neel NandaMar 29, 2023, 10:13 PM

211 points

26 comments19 min readLW link

(neelnanda.io)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

May 20, 2024, 5:53 PM

107 points

4 comments3 min readLW link

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner and Peter Hase

Apr 9, 2021, 7:19 PM

141 points

17 comments102 min readLW link

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

Dec 23, 2023, 2:44 AM

106 points

10 comments22 min readLW link 2 reviews

Comments on Anthropic’s Scaling Monosemanticity

Robert_AIZIJun 3, 2024, 12:15 PM

98 points

8 comments7 min readLW link

My tentative interpretability research agenda—topology matching.

Maxwell ClarkeOct 8, 2022, 10:14 PM

10 points

2 comments4 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM

58 points

0 comments59 min readLW link

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

Feb 5, 2023, 10:02 PM

676 points

206 comments12 min readLW link 1 review

The Plan − 2022 Update

johnswentworthDec 1, 2022, 8:43 PM

239 points

37 comments8 min readLW link 1 review

Circumventing interpretability: How to defeat mind-readers

Lee SharkeyJul 14, 2022, 4:59 PM

114 points

15 comments33 min readLW link

MATS Applications + Research Directions I’m Currently Excited About

Neel NandaFeb 6, 2025, 11:03 AM

73 points

7 comments8 min readLW link

Language Models Use Trigonometry to Do Addition

Subhash KantamneniFeb 5, 2025, 1:50 PM

76 points

1 comment10 min readLW link

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Neel NandaDec 21, 2022, 12:35 PM

91 points

6 comments2 min readLW link

(neelnanda.io)

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

Sep 3, 2020, 6:27 PM

68 points

11 comments2 min readLW link

LLMs Universally Learn a Feature Representing Token Frequency / Rarity

Sean OsierJun 30, 2024, 2:48 AM

12 points

5 comments6 min readLW link

(github.com)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

Jul 18, 2024, 2:15 PM

121 points

18 comments18 min readLW link

Mechanistic Transparency for Machine Learning

DanielFilanJul 11, 2018, 12:34 AM

55 points

9 comments4 min readLW link

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasperOct 11, 2024, 10:13 PM

68 points

4 comments7 min readLW link

200 COP in MI: Interpreting Algorithmic Problems

Neel NandaDec 31, 2022, 7:55 PM

33 points

2 comments10 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin PopeOct 13, 2021, 8:52 PM

9 points

0 comments2 min readLW link

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

Aug 6, 2024, 10:33 AM

11 points

0 comments1 min readLW link

(blog.eleuther.ai)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

Apr 27, 2024, 11:13 AM

246 points

95 comments10 min readLW link

Deep learning models might be secretly (almost) linear

berenApr 24, 2023, 6:43 PM

117 points

29 comments4 min readLW link

Basic facts about language models during training

berenFeb 21, 2023, 11:46 AM

98 points

15 comments18 min readLW link

Interpreting and Steering Features in Images

Gytis DaujotasJun 20, 2024, 6:33 PM

66 points

6 comments5 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofryApr 29, 2024, 8:57 PM

94 points

8 comments11 min readLW link

Introduction to inaccessible information

Ryan KiddDec 9, 2021, 1:28 AM

27 points

6 comments8 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

Dec 14, 2022, 2:33 PM

29 points

5 comments11 min readLW link

Theories of impact for Science of Deep Learning

Marius HobbhahnDec 1, 2022, 2:39 PM

24 points

0 comments11 min readLW link

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

Jun 24, 2024, 7:27 PM

96 points

4 comments8 min readLW link

(arxiv.org)

LLM Modularity: The Separability of Capabilities in Large Language Models

NickyPMar 26, 2023, 9:57 PM

99 points

3 comments41 min readLW link

(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders

Logan RiggsJul 5, 2023, 4:49 PM

60 points

1 comment7 min readLW link

Is Interpretability All We Need?

RogerDearnaleyNov 14, 2023, 5:31 AM

1 point

1 comment1 min readLW link

Towards Developmental Interpretability

Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet and Stan van Wingerden

Jul 12, 2023, 7:33 PM

192 points

10 comments9 min readLW link 1 review

Should we publish mechanistic interpretability research?

Marius Hobbhahn and LawrenceC

Apr 21, 2023, 4:19 PM

106 points

40 comments13 min readLW link

[Question] Can you MRI a deep learning model?

Yair HalberstadtJun 13, 2022, 1:43 PM

3 points

3 comments1 min readLW link

Article Review: Google’s AlphaTensor

Robert_AIZIOct 12, 2022, 6:04 PM

8 points

4 comments10 min readLW link

A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA)

Lao MeinSep 20, 2024, 1:13 PM

37 points

7 comments5 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

Mar 17, 2025, 10:27 PM

27 points

0 comments11 min readLW link

[Linkpost] Interpretability Dreams

DanielFilanMay 24, 2023, 9:08 PM

39 points

2 comments2 min readLW link

(transformer-circuits.pub)

EIS V: Blind Spots In AI Safety Interpretability Research

scasperFeb 16, 2023, 7:09 PM

57 points

24 comments10 min readLW link

Misrepresentation as a Barrier for Interp (Part I)

johnswentworth and Steve Petersen

Apr 29, 2025, 5:07 PM

103 points

11 comments7 min readLW link

SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

AdamYedidiaApr 15, 2023, 10:35 PM

71 points

18 comments6 min readLW link

Against blanket arguments against interpretability

Dmitry VaintrobJan 22, 2025, 9:46 AM

50 points

4 comments7 min readLW link

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs and Jannik Brinkmann

Mar 15, 2024, 4:30 PM

26 points

5 comments4 min readLW link

Deep Learning is cheap Solomonoff induction?

Lucius Bushnaq, Kaarel and Dmitry Vaintrob

Dec 7, 2024, 11:00 AM

45 points

1 comment17 min readLW link

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 5, 2024, 10:20 PM

60 points

13 comments10 min readLW link

Apply for the 2023 Developmental Interpretability Conference!

Stan van Wingerden, Alexander Gietelink Oldenziel, Jesse Hoogland and Daniel Murfet

Aug 25, 2023, 7:12 AM

33 points

0 comments2 min readLW link

Intervening in the Residual Stream

MadHatterFeb 22, 2023, 6:29 AM

30 points

1 comment9 min readLW link

AXRP Episode 21 - Interpretability for Engineers with Stephen Casper

DanielFilanMay 2, 2023, 12:50 AM

12 points

1 comment66 min readLW link

Duplicate token neurons in the first layer of GPT-2

Alex GibsonDec 27, 2024, 4:21 AM

4 points

0 comments5 min readLW link

The Computational Complexity of Circuit Discovery for Inner Interpretability

Bogdan Ionut CirsteaOct 17, 2024, 1:18 PM

11 points

2 comments1 min readLW link

(arxiv.org)

Progress Report 1: interpretability experiments & learning, testing compression hypotheses

Nathan Helm-BurgerMar 22, 2022, 8:12 PM

11 points

0 comments2 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron BergFeb 12, 2022, 7:13 PM

5 points

1 comment7 min readLW link

Rational Animations’ intro to mechanistic interpretability

WriterJun 14, 2024, 4:10 PM

45 points

1 comment11 min readLW link

(youtu.be)

Difficulty classes for alignment properties

JozdienFeb 20, 2024, 9:08 AM

34 points

5 comments2 min readLW link

Can we efficiently explain model behaviors?

paulfchristianoDec 16, 2022, 7:40 PM

64 points

3 comments9 min readLW link

(ai-alignment.com)

World-Model Interpretability Is All We Need

Thane RuthenisJan 14, 2023, 7:37 PM

36 points

22 comments21 min readLW link

Anthropic announces interpretability advances. How much does this advance alignment?

Seth HerdMay 21, 2024, 10:30 PM

49 points

4 comments3 min readLW link

(www.anthropic.com)

Causal Graphs of GPT-2-Small’s Residual Stream

David UdellJul 9, 2024, 10:06 PM

53 points

7 comments7 min readLW link

Mapping the semantic void: Strange goings-on in GPT embedding spaces

mwatkinsDec 14, 2023, 1:10 PM

114 points

31 comments14 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel NandaJul 7, 2024, 5:39 PM

136 points

16 comments25 min readLW link

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Jessica RumbelowAug 3, 2024, 12:07 PM

41 points

2 comments4 min readLW link

On Developing a Mathematical Theory of Interpretability

carboniferous_umbraculum Feb 9, 2023, 1:45 AM

64 points

8 comments6 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

May 29, 2024, 5:44 PM

93 points

0 comments7 min readLW link

fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR

Escaque 66Jul 11, 2023, 5:17 PM

−1 points

3 comments2 min readLW link

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

Sep 25, 2024, 10:35 PM

40 points

2 comments1 min readLW link

AXRP Episode 40 - Jason Gross on Compact Proofs and Interpretability

DanielFilanMar 28, 2025, 6:40 PM

23 points

0 comments89 min readLW link

200 COP in MI: Looking for Circuits in the Wild

Neel NandaDec 29, 2022, 8:59 PM

16 points

5 comments13 min readLW link

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik JennerJun 4, 2024, 3:50 PM

121 points

14 comments13 min readLW link

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilanAug 24, 2024, 10:30 PM

21 points

0 comments74 min readLW link

QAPR 3: interpretability-guided training of neural nets

Quintin PopeSep 28, 2022, 4:02 PM

58 points

2 comments10 min readLW link

Three ways interpretability could be impactful

Arthur ConmySep 18, 2023, 1:02 AM

47 points

8 comments4 min readLW link

ProLU: A Nonlinearity for Sparse Autoencoders

Glen TaggartApr 23, 2024, 2:09 PM

44 points

4 comments9 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

HoagyJul 17, 2023, 1:41 AM

57 points

1 comment7 min readLW link

Interpretability Will Not Reliably Find Deceptive AI

Neel NandaMay 4, 2025, 4:32 PM

287 points

49 comments7 min readLW link

Desiderata for an AI

Nathan Helm-BurgerJul 19, 2023, 4:18 PM

9 points

0 comments4 min readLW link

Physics of Language models (part 2.1)

Nathan Helm-BurgerSep 19, 2024, 4:48 PM

9 points

2 comments1 min readLW link

(youtu.be)

What Makes A Good Measurement Device?

johnswentworthAug 24, 2022, 10:45 PM

37 points

7 comments2 min readLW link

AI alignment as a translation problem

Roman LeventovFeb 5, 2024, 2:14 PM

22 points

2 comments3 min readLW link

200 COP in MI: Techniques, Tooling and Automation

Neel NandaJan 6, 2023, 3:08 PM

13 points

0 comments15 min readLW link

Mechanistic Interpretability Quickstart Guide

Neel NandaJan 31, 2023, 4:35 PM

42 points

3 comments6 min readLW link

(www.neelnanda.io)

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

Jul 20, 2023, 10:50 AM

44 points

3 comments2 min readLW link

(arxiv.org)

[Linkpost]Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Curtis HuebnerMay 4, 2023, 5:16 PM

10 points

1 comment1 min readLW link

(arxiv.org)

Mech Interp Challenge: January—Deciphering the Caesar Cipher Model

CallumMcDougallJan 1, 2024, 6:03 PM

17 points

0 comments3 min readLW link

Hedonic Loops and Taming RL

berenJul 19, 2023, 3:12 PM

20 points

14 comments9 min readLW link

Neel Nanda on the Mechanistic Interpretability Researcher Mindset

Michaël TrazziSep 21, 2023, 7:47 PM

37 points

1 comment3 min readLW link

(theinsideview.ai)

More findings on Memorization and double descent

Marius HobbhahnFeb 1, 2023, 6:26 PM

53 points

2 comments19 min readLW link

New OpenAI Paper—Language models can explain neurons in language models

MrThinkMay 10, 2023, 7:46 AM

47 points

14 comments1 min readLW link

Paper club: He et al. on modular arithmetic (part I)

Dmitry VaintrobJan 13, 2025, 11:18 AM

14 points

0 comments8 min readLW link

The generalization phase diagram

Dmitry VaintrobJan 26, 2025, 8:30 PM

26 points

2 comments16 min readLW link

Relaxed adversarial training for inner alignment

evhubSep 10, 2019, 11:03 PM

69 points

27 comments27 min readLW link

Transparency Trichotomy

Mark XuMar 28, 2021, 8:26 PM

25 points

2 comments7 min readLW link

Open problems in activation engineering

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

Jul 24, 2023, 7:46 PM

51 points

2 comments1 min readLW link

(coda.io)

How Do Selection Theorems Relate To Interpretability?

johnswentworthJun 9, 2022, 7:39 PM

60 points

14 comments3 min readLW link

Logits, log-odds, and loss for parallel circuits

Dmitry VaintrobJan 20, 2025, 9:56 AM

57 points

4 comments11 min readLW link

Progress Report 6: get the tool working

Nathan Helm-BurgerJun 10, 2022, 11:18 AM

4 points

0 comments2 min readLW link

Explaining the Transformer Circuits Framework by Example

Felix HofstätterApr 25, 2023, 1:45 PM

8 points

0 comments15 min readLW link

Introducing Leap Labs, an AI interpretability startup

Jessica RumbelowMar 6, 2023, 4:16 PM

103 points

12 comments1 min readLW link

Neuron Activations to CLIP Embeddings: Geometry of Linear Combinations in Latent Space

Roman MalovFeb 3, 2025, 10:30 AM

4 points

0 comments2 min readLW link

Transformer Dynamics: a neuro-inspired approach to MechInterp

guitchounts and jfernando

Feb 22, 2025, 9:33 PM

11 points

0 comments5 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jun 21, 2024, 12:56 PM

33 points

3 comments19 min readLW link

The conceptual Doppelgänger problem

TsviBTFeb 12, 2023, 5:23 PM

12 points

5 comments4 min readLW link

Review of AI Alignment Progress

PeterMcCluskeyFeb 7, 2023, 6:57 PM

72 points

32 comments7 min readLW link

(bayesianinvestor.com)

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZIMar 5, 2024, 1:55 PM

61 points

24 comments10 min readLW link

(aizi.substack.com)

SAEs Discover Meaningful Features in the IOI Task

Alex Makelov, Georg Lange and Neel Nanda

Jun 5, 2024, 11:48 PM

15 points

2 comments10 min readLW link

Geometry of Features in Mechanistic Interpretability

Gunnar CarlssonMar 14, 2025, 7:11 PM

16 points

0 comments8 min readLW link

200 COP in MI: Exploring Polysemanticity and Superposition

Neel NandaJan 3, 2023, 1:52 AM

34 points

6 comments16 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

Sep 21, 2023, 3:30 PM

159 points

8 comments5 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM

126 points

30 comments13 min readLW link

200 COP in MI: Studying Learned Features in Language Models

Neel NandaJan 19, 2023, 3:48 AM

24 points

2 comments30 min readLW link

Subsets and quotients in interpretability

Erik JennerDec 2, 2022, 11:13 PM

26 points

1 comment7 min readLW link

Neural network polytopes (Colab notebook)

Zach FurmanApr 21, 2023, 10:42 PM

11 points

0 comments1 min readLW link

(colab.research.google.com)

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

Apr 20, 2023, 10:26 PM

46 points

11 comments10 min readLW link

Mechanistic Interpretability Via Learning Differential Equations: AI Safety Camp Project Intermediate Report.

Valentin2026, ayoakin, Eduard Kovalets, tz3r0n4r, Soumyadeep Bose, Utkarsh Priyadarshi, Varun Piram and Axel Ahlqvist

May 8, 2025, 2:45 PM

6 points

0 comments7 min readLW link

Mech Interp Challenge: November—Deciphering the Cumulative Sum Model

CallumMcDougallNov 2, 2023, 5:10 PM

18 points

2 comments2 min readLW link

[ASoT] Natural abstractions and AlphaZero

Ulisse MiniDec 10, 2022, 5:53 PM

33 points

1 comment1 min readLW link

(arxiv.org)

Solving the whole AGI control problem, version 0.0001

Steven ByrnesApr 8, 2021, 3:14 PM

63 points

7 comments26 min readLW link

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Neel NandaNov 22, 2022, 5:12 PM

20 points

0 comments1 min readLW link

(www.youtube.com)

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael SoareverixSep 8, 2022, 3:20 PM

2 points

2 comments2 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel NandaDec 15, 2021, 11:44 PM

127 points

9 comments15 min readLW link

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

Jun 1, 2020, 1:25 PM

41 points

3 comments7 min readLW link

[Paper] Automated Feature Labeling with Token-Space Gradient Descent

Wuschel SchulzApr 30, 2025, 10:22 AM

4 points

0 comments4 min readLW link

EIS XV: A New Proof of Concept for Useful Interpretability

scasperMar 17, 2025, 8:05 PM

30 points

2 comments3 min readLW link

Interpretability

abergal and Nick_Beckstead

Oct 29, 2021, 7:28 AM

60 points

13 comments12 min readLW link

A Mystery About High Dimensional Concept Encoding

Fabien RogerNov 3, 2022, 5:05 PM

46 points

13 comments7 min readLW link

How ARENA course material gets made

CallumMcDougallJul 2, 2024, 6:04 PM

41 points

2 comments7 min readLW link

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Esben KranOct 31, 2022, 11:38 AM

20 points

1 comment1 min readLW link

(christophm.github.io)

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

May 3, 2024, 1:18 AM

48 points

6 comments1 min readLW link

Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalledSep 13, 2024, 5:09 PM

22 points

40 comments1 min readLW link

Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap

Nathan Helm-BurgerSep 23, 2021, 12:38 AM

21 points

2 comments12 min readLW link

AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu

DanielFilanJul 27, 2023, 1:50 AM

22 points

0 comments72 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

Dec 9, 2023, 2:27 AM

70 points

5 comments10 min readLW link

One Way to Think About ML Transparency

Matthew BarnettSep 2, 2019, 11:27 PM

26 points

28 comments5 min readLW link

Paper: Transformers learn in-context by gradient descent

LawrenceCDec 16, 2022, 11:10 AM

28 points

11 comments2 min readLW link

(arxiv.org)

Interpreting OpenAI’s Whisper

EllenaRSep 24, 2023, 5:53 PM

116 points

13 comments7 min readLW link

Video/animation: Neel Nanda explains what mechanistic interpretability is

DanielFilanFeb 22, 2023, 10:42 PM

24 points

7 comments1 min readLW link

(youtu.be)

Paper: Superposition, Memorization, and Double Descent (Anthropic)

LawrenceCJan 5, 2023, 5:54 PM

53 points

11 comments1 min readLW link

(transformer-circuits.pub)

Assessment of AI safety agendas: think about the downside risk

Roman LeventovDec 19, 2023, 9:00 AM

13 points

1 comment1 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

Apr 19, 2024, 7:06 PM

79 points

10 comments8 min readLW link

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy and Neel Nanda

Feb 24, 2025, 2:17 AM

48 points

1 comment7 min readLW link

[Question] Previous Work on Recreating Neural Network Input from Intermediate Layer Activations

bglassOct 12, 2022, 7:28 PM

1 point

3 comments1 min readLW link

How Do Induction Heads Actually Work in Transformers With Finite Capacity?

Fabien RogerMar 23, 2023, 9:09 AM

27 points

0 comments5 min readLW link

Really Strong Features Found in Residual Stream

Logan RiggsJul 8, 2023, 7:40 PM

69 points

6 comments2 min readLW link

Precursor checking for deceptive alignment

evhubAug 3, 2022, 10:56 PM

24 points

0 comments14 min readLW link

Comments on OpenPhil’s Interpretability RFP

paulfchristianoNov 5, 2021, 10:36 PM

91 points

5 comments7 min readLW link

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

Dec 19, 2024, 3:59 PM

42 points

6 comments11 min readLW link

Downstream applications as validation of interpretability progress

Sam MarksMar 31, 2025, 1:35 AM

112 points

3 comments7 min readLW link

[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. MurphyMay 26, 2025, 6:20 PM

41 points

3 comments1 min readLW link

Paper Replication Walkthrough: Reverse-Engineering Modular Addition

Neel NandaMar 12, 2023, 1:25 PM

18 points

0 comments1 min readLW link

(neelnanda.io)

(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel NandaOct 18, 2022, 9:08 PM

72 points

5 comments12 min readLW link

(www.neelnanda.io)

Durkon, an open-source tool for Inherently Interpretable Modelling

abstractapplicDec 24, 2022, 1:49 AM

37 points

0 comments4 min readLW link

Glitch Token Catalog - (Almost) a Full Clear

Lao MeinSep 21, 2024, 12:22 PM

38 points

3 comments37 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworthAug 8, 2022, 6:05 PM

144 points

13 comments3 min readLW link

Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo

Neel NandaJul 16, 2023, 10:02 PM

67 points

15 comments1 min readLW link

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

cmathw, Dennis Akar and Lee Sharkey

Apr 8, 2024, 11:14 AM

42 points

4 comments15 min readLW link

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllisSep 17, 2022, 7:48 PM

19 points

13 comments3 min readLW link

Neuronpedia

Johnny LinJul 26, 2023, 4:29 PM

135 points

51 comments2 min readLW link

(neuronpedia.org)

Cross-Layer Feature Alignment and Steering in Large Language Model

dlaptevFeb 8, 2025, 8:18 PM

5 points

0 comments6 min readLW link

Solving Interpretability Week

Logan RiggsDec 13, 2021, 5:09 PM

11 points

5 comments1 min readLW link

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM

9 points

1 comment4 min readLW link

“The Urgency of Interpretability” (Dario Amodei)

RobertMApr 27, 2025, 4:31 AM

30 points

23 comments3 min readLW link

(www.darioamodei.com)

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

Feb 20, 2023, 7:35 PM

96 points

8 comments21 min readLW link

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

Jaehyuk Lim, Kanishk Tantia and Sinem

Oct 11, 2024, 11:06 PM

8 points

2 comments10 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Activation additions in a small residual network

Garrett BakerMay 22, 2023, 8:28 PM

22 points

4 comments3 min readLW link

How can Interpretability help Alignment?

RobertKirk and Tomáš Gavenčiak

May 23, 2020, 4:16 PM

37 points

3 comments9 min readLW link

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasperMay 21, 2024, 8:15 PM

157 points

16 comments3 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam MarksApr 18, 2024, 4:17 PM

113 points

10 comments12 min readLW link

Circuits in Superposition: Compressing many small neural networks into one

Lucius Bushnaq and jake_mendel

Oct 14, 2024, 1:06 PM

130 points

9 comments13 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

Dec 6, 2024, 10:19 PM

165 points

12 comments11 min readLW link

(arxiv.org)

Language Models are a Potentially Safe Path to Human-Level AGI

Nadav BrandesApr 20, 2023, 12:40 AM

28 points

7 comments8 min readLW link 1 review

Some OthelloGPT Circuits

Alfred WongApr 15, 2025, 6:41 PM

7 points

0 comments7 min readLW link

Mech Interp Challenge: October—Deciphering the Sorted List Model

CallumMcDougallOct 3, 2023, 10:57 AM

23 points

0 comments3 min readLW link

Garrabrant and Shah on human modeling in AGI

Rob BensingerAug 4, 2021, 4:35 AM

60 points

10 comments47 min readLW link

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David UdellSep 23, 2023, 7:16 PM

42 points

7 comments34 min readLW link

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

Sep 4, 2024, 3:50 PM

19 points

0 comments3 min readLW link

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, Cody Rushing, Tom McGrath and Neel Nanda

Oct 13, 2023, 6:32 PM

82 points

4 comments8 min readLW link

Multi-dimensional rewards for AGI interpretability and control

Steven ByrnesJan 4, 2021, 3:08 AM

19 points

8 comments10 min readLW link

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

likennethJun 11, 2023, 5:38 AM

195 points

4 comments1 min readLW link

(arxiv.org)

Let’s buy out Cyc, for use in AGI interpretability systems?

Steven ByrnesDec 7, 2021, 8:46 PM

49 points

10 comments2 min readLW link

Othello-GPT: Reflections on the Research Process

Neel NandaMar 29, 2023, 10:13 PM

38 points

0 comments15 min readLW link

(neelnanda.io)

200 COP in MI: Image Model Interpretability

Neel NandaJan 8, 2023, 2:53 PM

18 points

3 comments6 min readLW link

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel NandaNov 1, 2022, 11:56 PM

69 points

16 comments1 min readLW link

(youtu.be)

The case for becoming a black-box investigator of language models

BuckMay 6, 2022, 2:35 PM

126 points

20 comments3 min readLW link

Dropout can create a privileged basis in the ReLU output model.

lewis smithApr 28, 2023, 1:59 AM

24 points

3 comments5 min readLW link

Othello-GPT: Future Work I Am Excited About

Neel NandaMar 29, 2023, 10:13 PM

48 points

2 comments33 min readLW link

(neelnanda.io)

AXRP Episode 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

DanielFilanJan 20, 2025, 12:40 AM

9 points

0 comments16 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett BakerAug 26, 2022, 6:26 PM

120 points

48 comments1 min readLW link

Why I stopped being into basin broadness

tailcalledApr 25, 2024, 8:47 PM

16 points

3 comments2 min readLW link

Addendum: basic facts about language models during training

berenMar 6, 2023, 7:24 PM

22 points

2 comments5 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob BensingerMar 5, 2021, 11:43 PM

142 points

13 comments26 min readLW link

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

Mar 31, 2023, 7:20 PM

101 points

17 comments11 min readLW link

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

Oct 13, 2023, 1:38 AM

70 points

0 comments1 min readLW link

(arxiv.org)

Mech Interp Project Advising Call: Memorisation in GPT-2 Small

Neel NandaFeb 4, 2023, 2:17 PM

7 points

0 comments1 min readLW link

Can Reasoning Models Avoid the Most Forbidden Technique?

Brendan LongMay 17, 2025, 11:26 PM

8 points

8 comments3 min readLW link

(www.brendanlong.com)

Thoughts on Toy Models of Superposition

james__pFeb 2, 2025, 1:52 PM

5 points

2 comments9 min readLW link

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics

DanielFilanSep 29, 2024, 5:50 AM

25 points

0 comments55 min readLW link

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

Dec 1, 2023, 2:54 AM

167 points

54 comments25 min readLW link

Stagewise Development in Neural Networks

Jesse Hoogland, Liam Carroll and Daniel Murfet

Mar 20, 2024, 7:54 PM

96 points

1 comment11 min readLW link

The Translucent Thoughts Hypotheses and Their Implications

Fabien RogerMar 9, 2023, 4:30 PM

142 points

7 comments19 min readLW link

SHIFT relies on token-level features to de-bias Bias in Bios probes

Tim HuaMar 19, 2025, 9:29 PM

39 points

2 comments6 min readLW link

AGI-Automated Interpretability is Suicide

__RicG__May 10, 2023, 2:20 PM

25 points

33 comments7 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

Dec 8, 2023, 5:08 PM

82 points

7 comments7 min readLW link

Knowledge Neurons in Pretrained Transformers

evhubMay 17, 2021, 10:54 PM

100 points

7 comments2 min readLW link

(arxiv.org)

AXRP Episode 38.2 - Jesse Hoogland on Singular Learning Theory

DanielFilanNov 27, 2024, 6:30 AM

34 points

0 comments10 min readLW link

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHexAug 8, 2024, 6:33 PM

165 points

11 comments8 min readLW link

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel NandaNov 7, 2022, 10:39 PM

30 points

15 comments3 min readLW link

(youtu.be)

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran and Neel Nanda

Apr 13, 2023, 11:59 AM

18 points

0 comments8 min readLW link

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind and Demian Till

Dec 30, 2024, 10:50 PM

22 points

3 comments15 min readLW link

You’re Measuring Model Complexity Wrong

Jesse Hoogland and Stan van Wingerden

Oct 11, 2023, 11:46 AM

93 points

17 comments13 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasperSep 12, 2022, 7:07 PM

97 points

7 comments2 min readLW link

(arxiv.org)

Measuring Structure Development in Algorithmic Transformers

Micurie and Einar Urdshals

Aug 22, 2024, 8:38 AM

56 points

4 comments11 min readLW link

Growing Bonsai Networks with RNNs

ameoAug 7, 2023, 5:34 PM

21 points

5 comments1 min readLW link

(cprimozic.net)

Literature Review of Text AutoEncoders

NickyPFeb 19, 2025, 9:54 PM

18 points

5 comments8 min readLW link

Toy Models of Superposition

evhubSep 21, 2022, 11:48 PM

69 points

4 comments5 min readLW link 1 review

(transformer-circuits.pub)

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

Aug 16, 2022, 2:09 AM

21 points

2 comments16 min readLW link

Mech Interp Challenge: September—Deciphering the Addition Model

CallumMcDougallSep 13, 2023, 10:23 PM

35 points

0 comments4 min readLW link

How does GPT-3 spend its 175B parameters?

Robert_AIZIJan 13, 2023, 7:21 PM

41 points

14 comments6 min readLW link

(aizi.substack.com)

A Walkthrough of A Mathematical Framework for Transformer Circuits

Neel NandaOct 25, 2022, 8:24 PM

52 points

7 comments1 min readLW link

(www.youtube.com)

Wittgenstein and ML — parameters vs architecture

Cleo NardoMar 24, 2023, 4:54 AM

44 points

9 comments5 min readLW link

Inner Alignment in Salt-Starved Rats

Steven ByrnesNov 19, 2020, 2:40 AM

137 points

41 comments11 min readLW link 2 reviews

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

Jul 19, 2024, 4:10 PM

49 points

10 comments1 min readLW link

(storage.googleapis.com)

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

Apr 1, 2024, 9:09 AM

109 points

2 comments4 min readLW link

[Question] How optimistic should we be about AI figuring out how to interpret itself?

oh54321Jul 25, 2022, 10:09 PM

3 points

1 comment1 min readLW link

AI Transparency: Why it’s critical and how to obtain it.

Zohar JacksonAug 14, 2022, 10:31 AM

6 points

1 comment5 min readLW link

Exploring Concept-Specific Slices in Weight Matrices for Network Interpretability

DuncanFowlerJun 9, 2023, 4:39 PM

1 point

0 comments6 min readLW link

Why and When Interpretability Work is Dangerous

Nicholas / Heather KrossMay 28, 2023, 12:27 AM

20 points

9 comments8 min readLW link

(www.thinkingmuchbetter.com)

Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search”

RobertMSep 14, 2023, 2:18 AM

87 points

4 comments8 min readLW link

Search versus design

Alex FlintAug 16, 2020, 4:53 PM

109 points

40 comments36 min readLW link 1 review

Apollo Research is hiring evals and interpretability engineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM

25 points

0 comments2 min readLW link

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Steven ByrnesMar 23, 2022, 12:48 PM

46 points

11 comments22 min readLW link

AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda

DanielFilanFeb 4, 2023, 3:00 AM

45 points

0 comments117 min readLW link

Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougallNov 29, 2023, 12:56 PM

76 points

9 comments4 min readLW link

Superposition is not “just” neuron polysemanticity

LawrenceCApr 26, 2024, 11:22 PM

66 points

4 comments13 min readLW link

Steelmanning heuristic arguments

Dmitry VaintrobApr 13, 2025, 1:09 AM

72 points

0 comments17 min readLW link

To be legible, evidence of misalignment probably has to be behavioral

ryan_greenblattApr 15, 2025, 6:14 PM

55 points

19 comments3 min readLW link

[ASoT] Policy Trajectory Visualization

Ulisse MiniFeb 7, 2023, 12:13 AM

9 points

2 comments1 min readLW link

Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy

Neel NandaAug 29, 2023, 10:07 PM

36 points

1 comment1 min readLW link

(www.youtube.com)

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworthJun 4, 2022, 5:41 AM

160 points

55 comments2 min readLW link 1 review

200 COP in MI: The Case for Analysing Toy Language Models

Neel NandaDec 28, 2022, 9:07 PM

40 points

3 comments7 min readLW link

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

Mar 31, 2024, 3:30 PM

74 points

8 comments1 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

Mar 11, 2023, 6:59 PM

334 points

28 comments23 min readLW link

More Recent Progress in the Theory of Neural Networks

jylin04Oct 6, 2022, 4:57 PM

82 points

6 comments4 min readLW link

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel NandaOct 24, 2022, 8:45 PM

64 points

12 comments3 min readLW link

(neelnanda.io)

Finding gliders in the game of life

paulfchristianoDec 1, 2022, 8:40 PM

104 points

8 comments16 min readLW link

(ai-alignment.com)

Mechanism for feature learning in neural networks and backpropagation-free machine learning models

Matt GoldenbergMar 19, 2024, 2:55 PM

8 points

1 comment1 min readLW link

(www.science.org)

Investigating task-specific prompts and sparse autoencoders for activation monitoring

Henk TillmanApr 30, 2025, 5:09 PM

23 points

0 comments1 min readLW link

(arxiv.org)

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

Jul 1, 2024, 9:35 PM

74 points

12 comments9 min readLW link

Conditional Importance in Toy Models of Superposition

james__pFeb 2, 2025, 8:35 PM

9 points

4 comments10 min readLW link

200 COP in MI: Interpreting Reinforcement Learning

Neel NandaJan 10, 2023, 5:37 PM

25 points

1 comment10 min readLW link

On polytopes

Dmitry VaintrobJan 25, 2025, 1:56 PM

56 points

5 comments12 min readLW link

“Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)Oct 31, 2022, 9:26 PM

51 points

25 comments2 min readLW link

Verification and Transparency

DanielFilanAug 8, 2019, 1:50 AM

35 points

6 comments2 min readLW link

(danielfilan.com)

The memorization-generalization spectrum and learning coefficients

Dmitry VaintrobJan 28, 2025, 4:53 PM

17 points

0 comments10 min readLW link

Exciting New Interpretability Paper!

research_prime_spaceMay 9, 2023, 4:39 PM

12 points

1 comment1 min readLW link

Mech Interp Puzzle 2: Word2Vec Style Embeddings

Neel NandaJul 28, 2023, 12:50 AM

41 points

4 comments2 min readLW link

200 COP in MI: Analysing Training Dynamics

Neel NandaJan 4, 2023, 4:08 PM

16 points

0 comments14 min readLW link

QFT and neural nets: the basic idea

Dmitry VaintrobJan 24, 2025, 1:54 PM

26 points

0 comments8 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

May 22, 2024, 8:55 AM

50 points

0 comments1 min readLW link

(humanaligned.ai)

An Analytic Perspective on AI Alignment

DanielFilanMar 1, 2020, 4:10 AM

54 points

45 comments8 min readLW link

(danielfilan.com)

Analogies between Software Reverse Engineering and Mechanistic Interpretability

Neel Nanda and Itay Yona

Dec 26, 2022, 12:26 PM

34 points

6 comments11 min readLW link

(www.neelnanda.io)

Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds

1a3ornApr 4, 2023, 5:39 PM

211 points

38 comments5 min readLW link 1 review

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Neel NandaDec 25, 2022, 10:21 PM

57 points

7 comments12 min readLW link

(www.neelnanda.io)

Swap and Scale

Stephen FowlerSep 9, 2022, 10:41 PM

17 points

3 comments1 min readLW link

SAEs you can See: Applying Sparse Autoencoders to Clustering

Robert_AIZIOct 28, 2024, 2:48 PM

27 points

0 comments10 min readLW link

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Neel Nanda, János Kramár, Tom Lieberum and Rohin Shah

Mar 18, 2024, 5:28 PM

19 points

0 comments1 min readLW link

(arxiv.org)

More findings on maximal data dimension

Marius HobbhahnFeb 2, 2023, 6:33 PM

27 points

1 comment11 min readLW link

Shapley Value Attribution in Chain of Thought

leogaoApr 14, 2023, 5:56 AM

106 points

7 comments4 min readLW link

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

141 points

15 comments13 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

Apr 25, 2024, 6:43 PM

63 points

38 comments1 min readLW link

(arxiv.org)

Implementing activation steering

AnnahFeb 5, 2024, 5:51 PM

75 points

8 comments7 min readLW link

Attribution Patching: Activation Patching At Industrial Scale

Neel NandaMar 16, 2023, 9:44 PM

45 points

10 comments58 min readLW link

(www.neelnanda.io)

Exploring SAE features in LLMs with definition trees and token lists

mwatkinsOct 4, 2024, 10:15 PM

38 points

5 comments6 min readLW link

Tiny Mech Interp Projects: Emergent Positional Embeddings of Words

Neel NandaJul 18, 2023, 9:24 PM

51 points

1 comment9 min readLW link

Visualizing Neural networks, how to blame the bias

Donald HobsonJul 9, 2022, 3:52 PM

7 points

1 comment6 min readLW link

Dmitry’s Koan

Dmitry VaintrobJan 10, 2025, 4:27 AM

44 points

8 comments22 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

Apr 19, 2024, 7:06 PM

72 points

0 comments3 min readLW link

[Question] Transformer Mech Interp: Any visualizations?

Joyee ChenJan 18, 2023, 4:32 AM

3 points

0 comments1 min readLW link

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

Oct 7, 2022, 5:51 PM

87 points

14 comments3 min readLW link

Beyond Gaussian: Language Model Representations and Distributions

Matt LevinsonNov 24, 2024, 1:53 AM

6 points

1 comment5 min readLW link

[Research Update] Sparse Autoencoder features are bimodal

Robert_AIZIJun 22, 2023, 1:15 PM

24 points

1 comment5 min readLW link

(aizi.substack.com)

SimpleStories: A Better Synthetic Dataset and Tiny Models for Interpretability

Lennart FinkeMay 3, 2025, 2:04 PM

12 points

0 comments1 min readLW link

Inside the mind of a superhuman Go model: How does Leela Zero read ladders?

Haoxing DuMar 1, 2023, 1:47 AM

157 points

8 comments30 min readLW link

A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens

Martin FellMay 9, 2023, 2:36 PM

26 points

9 comments6 min readLW link

Bird-eye view visualization of LLM activations

SergiiOct 8, 2023, 12:12 PM

11 points

2 comments1 min readLW link

(grgv.xyz)

Speculations against GPT-n writing alignment papers

Donald HobsonJun 7, 2021, 9:13 PM

31 points

6 comments2 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

Mar 26, 2025, 7:07 PM

113 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

What are polysemantic neurons?

Vishakha and Algon

Jan 8, 2025, 7:35 AM

8 points

0 comments4 min readLW link

(aisafety.info)

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

RemmeltDec 19, 2022, 12:02 PM

−3 points

9 comments31 min readLW link

[linkpost] Acquisition of Chess Knowledge in AlphaZero

Quintin PopeNov 23, 2021, 7:55 AM

8 points

1 comment1 min readLW link

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

Oct 28, 2022, 11:55 PM

101 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Deconfusing “Capabilities vs. Alignment”

RobertMJan 23, 2023, 4:46 AM

27 points

7 comments2 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam ShaiApr 16, 2024, 9:16 PM

419 points

100 comments12 min readLW link

EIS II: What is “Interpretability”?

scasperFeb 9, 2023, 4:48 PM

28 points

6 comments4 min readLW link

AISC 2023, Progress Report for March: Team Interpretable Architectures

Robert Kralisch, Eris, teahorse and Sohaib Imran

Apr 2, 2023, 4:19 PM

14 points

0 comments14 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

May 28, 2024, 5:29 AM

50 points

1 comment9 min readLW link

(arxiv.org)

A technical note on bilinear layers for interpretability

Lee SharkeyMay 8, 2023, 6:06 AM

59 points

0 comments1 min readLW link

(arxiv.org)

EIS IX: Interpretability and Adversaries

scasperFeb 20, 2023, 6:25 PM

30 points

8 comments8 min readLW link

Positional kernels of attention heads

Alex GibsonMar 10, 2025, 11:17 PM

9 points

0 comments12 min readLW link

Relationships among words, metalingual definition, and interpretability

Bill BenzonJun 7, 2024, 7:18 PM

2 points

0 comments5 min readLW link

Neural networks generalize because of this one weird trick

Jesse HooglandJan 18, 2023, 12:10 AM

183 points

34 comments15 min readLW link 1 review

(www.jessehoogland.com)

Deep sparse autoencoders yield interpretable features too

Armaan A. AbrahamFeb 23, 2025, 5:46 AM

29 points

8 comments8 min readLW link

Selective regularization for alignment-focused representation engineering

Sandy FraserMay 20, 2025, 12:54 PM

19 points

3 comments12 min readLW link

Hard-Coding Neural Computation

MadHatterDec 13, 2021, 4:35 AM

34 points

8 comments27 min readLW link

The risk-reward tradeoff of interpretability research

JustinShovelain and Elliot Mckernon

Jul 5, 2023, 5:05 PM

15 points

1 comment6 min readLW link

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

Aug 14, 2023, 5:04 PM

84 points

4 comments1 min readLW link

Interpretability of SAE Features Representing Check in ChessGPT

Jonathan KutasovOct 5, 2024, 8:43 PM

27 points

2 comments8 min readLW link

Anthropic’s SoLU (Softmax Linear Unit)

Joel BurgetJul 4, 2022, 6:38 PM

21 points

1 comment4 min readLW link

(transformer-circuits.pub)

AI psychology should ground the theories of AI consciousness and inform human-AI ethical interaction design

Roman LeventovJan 8, 2023, 6:37 AM

20 points

8 comments2 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

Sep 8, 2023, 2:39 PM

101 points

11 comments3 min readLW link

(arxiv.org)

Mechanistically interpreting time in GPT-2 small

rgould, Elizabeth Ho and Arthur Conmy

Apr 16, 2023, 5:57 PM

68 points

6 comments21 min readLW link

Observations on self-supervised Learning for vision

Dinkar JuyalMar 10, 2025, 7:31 PM

3 points

0 comments5 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

Jul 30, 2024, 9:11 PM

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Domain-specific SAEs

jacob_droriOct 7, 2024, 8:15 PM

28 points

2 comments5 min readLW link

Towards an Ethics Calculator for Use by an AGI

sweenesmDec 12, 2023, 6:37 PM

3 points

2 comments11 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Feb 3, 2024, 6:50 AM

78 points

4 comments8 min readLW link

Can we efficiently distinguish different mechanisms?

paulfchristianoDec 27, 2022, 12:20 AM

91 points

30 comments16 min readLW link

(ai-alignment.com)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

Sep 25, 2024, 8:37 PM

29 points

0 comments3 min readLW link

(arxiv.org)

Teaser: Hard-coding Transformer Models

MadHatterDec 12, 2021, 10:04 PM

74 points

19 comments1 min readLW link

Seeking Feedback on My Mechanistic Interpretability Research Agenda

RGRGRGSep 12, 2023, 6:45 PM

3 points

1 comment3 min readLW link

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

Oct 7, 2024, 9:56 AM

49 points

8 comments10 min readLW link

Eliciting Latent Knowledge in Comprehensive AI Services Models

acabodiNov 17, 2023, 2:36 AM

6 points

0 comments5 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneillMar 24, 2024, 8:05 PM

30 points

4 comments24 min readLW link

Exploring how OthelloGPT computes its world model

JMaarFeb 2, 2025, 9:29 PM

7 points

0 comments8 min readLW link

What I am working on right now and why: representation engineering edition

Lukasz G BartoszczeMar 18, 2025, 10:37 PM

3 points

0 comments3 min readLW link

[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo

Bogdan Ionut CirsteaJun 17, 2023, 4:38 PM

12 points

0 comments1 min readLW link

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

Oct 12, 2024, 2:54 PM

29 points

4 comments7 min readLW link

There is a globe in your LLM

jacob_droriOct 8, 2024, 12:43 AM

89 points

4 comments1 min readLW link

Exploring the Evolution and Migration of Different Layer Embedding in LLMs

Ruixuan HuangMar 8, 2024, 3:01 PM

6 points

0 comments8 min readLW link

Des: A Case Study in Emergent Symbolic Continuity in GPT-4o

TallulahMerrallMay 19, 2025, 10:10 AM

1 point

0 comments5 min readLW link

The Quantization Model of Neural Scaling

nzMar 31, 2023, 4:02 PM

17 points

0 comments1 min readLW link

(arxiv.org)

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

Sep 28, 2023, 7:30 PM

72 points

4 comments21 min readLW link

Activation adding experiments with llama-7b

Nina PanicksseryJul 16, 2023, 4:17 AM

51 points

1 comment3 min readLW link

Informal semantics and Orders

Q HomeAug 27, 2022, 4:17 AM

14 points

10 comments26 min readLW link

We Found An Neuron in GPT-2

Joseph Miller and Clement Neo

Feb 11, 2023, 6:27 PM

143 points

23 comments7 min readLW link

(clementneo.com)

The role of philosophical thinking in understanding large language models: Calibrating and closing the gap between first-person experience and underlying mechanisms

Bill BenzonFeb 23, 2024, 12:19 PM

4 points

0 comments10 min readLW link

Developmental Stages in Multi-Problem Grokking

James SullivanSep 29, 2024, 6:58 PM

4 points

0 comments6 min readLW link

Sentience in Machines—How Do We Test for This Objectively?

Mayowa OsiboduMar 26, 2023, 6:56 PM

−2 points

0 comments2 min readLW link

(www.researchgate.net)

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex and Marius Hobbhahn

May 9, 2023, 7:41 PM

119 points

1 comment10 min readLW link

z is not the cause of x

hrbigelowOct 23, 2023, 5:43 PM

6 points

2 comments9 min readLW link

Closed-Source Evaluations

JonoJun 8, 2024, 2:18 PM

15 points

4 comments1 min readLW link

Will transparency help catch deception? Perhaps not

Matthew BarnettNov 4, 2019, 8:52 PM

43 points

5 comments7 min readLW link

The subset parity learning problem: much more than you wanted to know

Dmitry VaintrobJan 3, 2025, 9:13 AM

94 points

18 comments11 min readLW link

Some Notes on the mathematics of Toy Autoencoding Problems

carboniferous_umbraculum Dec 22, 2022, 5:21 PM

18 points

1 comment12 min readLW link

Estimating the Probability of Sampling a Trained Neural Network at Random

Adam Scherlis and Nora Belrose

Mar 1, 2025, 2:11 AM

32 points

10 comments1 min readLW link

(arxiv.org)

The Engineer’s Interpretability Sequence (EIS) I: Intro

scasperFeb 9, 2023, 4:28 PM

46 points

24 comments3 min readLW link

Using the probabilistic method to bound the performance of toy transformers

Alex GibsonJan 21, 2025, 11:01 PM

1 point

0 comments3 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

Aug 29, 2023, 1:04 AM

77 points

4 comments1 min readLW link

Fact Finding: How to Think About Interpreting Memorisation (Post 4)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

Dec 23, 2023, 2:46 AM

22 points

0 comments9 min readLW link

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob DunefskyFeb 28, 2025, 12:01 PM

20 points

1 comment14 min readLW link

(arxiv.org)

Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.

Yash SrivastavaJul 17, 2024, 2:02 AM

3 points

2 comments5 min readLW link

Mechanistic interpretability through clustering

Alistair FraserDec 4, 2023, 6:49 PM

1 point

0 comments1 min readLW link

Monet: Mixture of Monosemantic Experts for Transformers Explained

CalebMarescaJan 25, 2025, 7:37 PM

20 points

2 comments11 min readLW link

QNR prospects are important for AI alignment research

Eric DrexlerFeb 3, 2022, 3:20 PM

94 points

12 comments11 min readLW link 1 review

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo and Jason Hoelscher-Obermaier

Jun 17, 2024, 2:16 PM

9 points

0 comments8 min readLW link

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM

1 point

0 comments3 min readLW link

Exploring Llama-3-8B MLP Neurons

ntt123Jun 9, 2024, 2:19 PM

10 points

0 comments4 min readLW link

(neuralblog.github.io)

The positional embedding matrix and previous-token heads: how do they actually work?

AdamYedidiaAug 10, 2023, 1:58 AM

26 points

4 comments13 min readLW link

Against LLM Reductionism

Erich_GrunewaldMar 8, 2023, 3:52 PM

140 points

17 comments18 min readLW link

(www.erichgrunewald.com)

Progress Report 4: logit lens redux

Nathan Helm-BurgerApr 8, 2022, 6:35 PM

4 points

0 comments2 min readLW link

Entanglement and intuition about words and meaning

Bill BenzonOct 4, 2023, 2:16 PM

4 points

0 comments2 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

Apr 9, 2024, 7:31 PM

67 points

8 comments10 min readLW link

The REPHRASE Circuit: How Fine-Tuning Enhances LLMs to REPHRASE Text

Karthik ViswanathanApr 6, 2025, 3:02 PM

4 points

0 comments5 min readLW link

Finding Deception in Language Models

Esben Kran and Archana Vaidheeswaran

Aug 20, 2024, 9:42 AM

20 points

4 comments4 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

Jul 7, 2022, 3:20 AM

50 points

15 comments8 min readLW link

Past Tense Features

CanApr 20, 2024, 2:34 PM

12 points

0 comments4 min readLW link

Open Call for Research Assistants in Developmental Interpretability

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

Aug 30, 2023, 9:02 AM

56 points

11 comments4 min readLW link

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

Nov 16, 2022, 2:14 PM

89 points

2 comments12 min readLW link

DSLT 0. Distilling Singular Learning Theory

Liam CarrollJun 16, 2023, 9:50 AM

80 points

7 comments5 min readLW link

If interpretability research goes well, it may get dangerous

So8resApr 3, 2023, 9:48 PM

202 points

11 comments2 min readLW link

Backdoors have universal representations across large language models

Amirali Abdullah, Narmeen, Dhruv Nathawani and nirmalendu prakash

Dec 6, 2024, 10:56 PM

16 points

0 comments16 min readLW link

Memetic Judo #3: The Intelligence of Stochastic Parrots v.2

Max TKAug 20, 2023, 3:18 PM

8 points

33 comments6 min readLW link

PhD Position: AI Interpretability in Berlin, Germany

TiberiusApr 28, 2023, 1:44 PM

3 points

0 comments1 min readLW link

(stephanw.net)

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre PeignéSep 23, 2023, 4:21 PM

30 points

8 comments5 min readLW link

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

Sep 8, 2022, 2:25 AM

44 points

3 comments14 min readLW link

Thoughts on Formalizing Composition

Tom LieberumJun 7, 2022, 7:51 AM

13 points

0 comments7 min readLW link

Recall and Regurgitation in GPT2

Megan KinnimentOct 3, 2022, 7:35 PM

43 points

1 comment26 min readLW link

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, hrdkbhatnagar and Joseph Bloom

Dec 20, 2024, 3:16 PM

32 points

0 comments37 min readLW link

Understanding LLMs: Some basic observations about words, syntax, and discourse [w/ a conjecture about grokking]

Bill BenzonOct 11, 2023, 7:13 PM

6 points

0 comments5 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

Concept-anchored representation engineering for alignment

Sandy FraserMay 8, 2025, 8:59 AM

3 points

0 comments3 min readLW link

Mapping ChatGPT’s ontological landscape, gradients and choices [interpretability]

Bill BenzonOct 15, 2023, 8:12 PM

1 point

0 comments18 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

Aug 30, 2023, 5:36 PM

17 points

0 comments8 min readLW link

(arxiv.org)

Towards a Unified Interpretability of Artificial and Biological Neural Networks

jan_bauerDec 21, 2024, 11:10 PM

2 points

0 comments1 min readLW link

Localizing goal misgeneralization in a maze-solving policy network

Jan BetleyJul 6, 2023, 4:21 PM

37 points

2 comments7 min readLW link

Scaling Laws and Superposition

Pavan KattaApr 10, 2024, 3:36 PM

9 points

4 comments5 min readLW link

(www.pavankatta.com)

Mathematical Circuits in Neural Networks

Sean OsierSep 22, 2022, 3:48 AM

34 points

4 comments1 min readLW link

(www.youtube.com)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

Jan 6, 2025, 4:22 AM

19 points

0 comments12 min readLW link

Incidental polysemanticity

Victor Lecomte, Kushal Thaman, tmychow and Rylan Schaeffer

Nov 15, 2023, 4:00 AM

43 points

7 comments11 min readLW link

It’s important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation

Gerard BoxoOct 14, 2024, 5:04 PM

9 points

0 comments6 min readLW link

(gboxo.github.io)

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

Jan 29, 2025, 10:25 AM

68 points

0 comments1 min readLW link

(arxiv.org)

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

34 points

1 comment17 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul CologneseFeb 22, 2024, 8:02 AM

16 points

0 comments8 min readLW link

Expanding the Scope of Superposition

Derek LarsonSep 13, 2023, 5:38 PM

10 points

0 comments4 min readLW link

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

SodiumOct 3, 2024, 7:11 PM

35 points

17 comments17 min readLW link

EIS XII: Summary

scasperFeb 23, 2023, 5:45 PM

19 points

0 comments6 min readLW link

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang, Miles Wang and kaivu

Dec 15, 2023, 11:05 AM

33 points

8 comments10 min readLW link

Interpretability Externalities Case Study—Hungry Hungry Hippos

Magdalena WacheSep 20, 2023, 2:42 PM

64 points

22 comments2 min readLW link

Idea: Network modularity and interpretability by sexual reproduction

qbolecFeb 12, 2023, 11:06 PM

3 points

3 comments1 min readLW link

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Alexandre Variengien and Eric Winsor

Dec 19, 2023, 11:52 AM

84 points

3 comments10 min readLW link

(arxiv.org)

ChatGPT tells stories, and a note about reverse engineering: A Working Paper

Bill BenzonMar 3, 2023, 3:12 PM

3 points

0 comments3 min readLW link

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger (formerly: capybaralet)

Nov 23, 2024, 5:30 PM

38 points

0 comments5 min readLW link

Fluent dreaming for language models (AI interpretability method)

tbenthompson, mikes and Zygi Straznickas

Feb 6, 2024, 6:02 AM

46 points

5 comments1 min readLW link

(arxiv.org)

Basic Facts about Language Model Internals

beren and Eric Winsor

Jan 4, 2023, 1:01 PM

130 points

19 comments9 min readLW link

Mech Interp Lacks Good Paradigms

Daniel TanJul 16, 2024, 3:47 PM

40 points

0 comments14 min readLW link

Trying to find the underlying structure of computational systems

Matthias G. MayerSep 13, 2022, 9:16 PM

18 points

9 comments4 min readLW link

[Question] Have you heard about MIT’s “liquid neural networks”? What do you think about them?

PpauMay 9, 2023, 8:16 PM

35 points

14 comments1 min readLW link

Labelling, Variables, and In-Context Learning in Llama2

Joshua PenmanAug 3, 2024, 7:36 PM

6 points

0 comments1 min readLW link

(colab.research.google.com)

Fact Finding: Simplifying the Circuit (Post 2)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

Dec 23, 2023, 2:45 AM

25 points

3 comments14 min readLW link

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

Nov 14, 2024, 1:06 PM

21 points

0 comments9 min readLW link

The limited upside of interpretability

Peter S. ParkNov 15, 2022, 6:46 PM

13 points

11 comments1 min readLW link

An introduction to language model interpretability

Alexandre VariengienApr 20, 2023, 10:22 PM

14 points

0 comments9 min readLW link

Spooky action at a distance in the loss landscape

Jesse Hoogland and Filip Sondej

Jan 28, 2023, 12:22 AM

61 points

4 comments7 min readLW link

(www.jessehoogland.com)

Estimating effective dimensionality of MNIST models

Arjun PanicksseryNov 2, 2023, 2:13 PM

41 points

3 comments1 min readLW link

Gears-Level Mental Models of Transformer Interpretability

RowanWangMar 29, 2022, 8:09 PM

73 points

4 comments6 min readLW link

Normalizing Sparse Autoencoders

Fengyuan HuApr 8, 2024, 6:17 AM

22 points

18 comments13 min readLW link

Biases in Biases, or Critique of the Critique

ThePathYouWillChooseAug 19, 2024, 5:11 PM

1 point

0 comments1 min readLW link

Initial Experiments Using SAEs to Help Detect AI Generated Text

Aaron_ScherJul 22, 2024, 5:16 AM

17 points

0 comments14 min readLW link

A short project on Mamba: grokking & interpretability

Alejandro TlaieOct 18, 2024, 4:59 PM

21 points

0 comments6 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

18 points

4 comments20 min readLW link

Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen and Mateusz Dziemian

Jul 15, 2024, 5:07 PM

33 points

2 comments7 min readLW link

But is it really in Rome? An investigation of the ROME model editing technique

jacquesthibsDec 30, 2022, 2:40 AM

104 points

2 comments18 min readLW link

AI Self Portraits Aren’t Accurate

JustisMillsApr 27, 2025, 3:27 AM

57 points

10 comments5 min readLW link

The king token

p.b.May 28, 2023, 7:18 PM

17 points

0 comments4 min readLW link

Constructing Neural Network Parameters with Downstream Trainability

ch271828nJul 31, 2024, 6:13 PM

1 point

0 comments1 min readLW link

(github.com)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

Jul 25, 2024, 10:00 PM

59 points

8 comments2 min readLW link

(arxiv.org)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Sonia Joseph and Neel Nanda

Mar 13, 2024, 5:09 PM

44 points

13 comments14 min readLW link

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

Aug 17, 2024, 1:16 AM

53 points

0 comments5 min readLW link

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

Jul 12, 2024, 3:47 AM

104 points

5 comments7 min readLW link

(arxiv.org)

Understanding understanding

mthqAug 23, 2019, 6:10 PM

24 points

1 comment2 min readLW link

Barriers to Mechanistic Interpretability for AGI Safety

Connor LeahyAug 29, 2023, 10:56 AM

63 points

13 comments1 min readLW link

(www.youtube.com)

My current workflow to study the internal mechanisms of LLM

Yulu PiMay 16, 2023, 3:27 PM

4 points

0 comments1 min readLW link

Causality and a Cost Semantics for Neural Networks

scottviteriAug 21, 2023, 9:02 PM

22 points

1 comment1 min readLW link

Interpreting Complexity

Maxwell AdamMar 14, 2025, 4:52 AM

53 points

8 comments26 min readLW link

An idea for avoiding neuralese architectures

Knight LeeApr 3, 2025, 10:23 PM

8 points

2 comments4 min readLW link

Interpreting Modular Addition in MLPs

Bart BussmannJul 7, 2023, 9:22 AM

20 points

0 comments6 min readLW link

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony DiamondMar 18, 2025, 6:03 PM

10 points

2 comments1 min readLW link

graphpatch: a Python Library for Activation Patching

Occam's LaserJun 5, 2024, 3:08 PM

13 points

2 comments1 min readLW link

Matryoshka Sparse Autoencoders

Noa NabeshimaDec 14, 2024, 2:52 AM

98 points

15 comments11 min readLW link

[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers

Bogdan Ionut CirsteaAug 4, 2023, 3:29 PM

11 points

0 comments1 min readLW link

Visualizing Learned Representations of Rice Disease

muhia_beeOct 3, 2022, 9:09 AM

7 points

0 comments4 min readLW link

(indecisive-sand-24a.notion.site)

ChatGPT: Tantalizing afterthoughts in search of story trajectories [induction heads]

Bill BenzonFeb 3, 2023, 10:35 AM

4 points

0 comments20 min readLW link

Creating a Discord server for Mechanistic Interpretability Projects

Victor LevosoMar 12, 2023, 6:00 PM

30 points

6 comments2 min readLW link

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph BloomMay 16, 2023, 10:59 PM

36 points

2 comments16 min readLW link

Trying to isolate objectives: approaches toward high-level interpretability

JozdienJan 9, 2023, 6:33 PM

49 points

14 comments8 min readLW link

Monosemanticity & Quantization

Rahul ChandOct 22, 2024, 10:57 PM

1 point

0 comments9 min readLW link

Toy Models of Superposition: Simplified by Hand

Axel SorensenSep 29, 2024, 9:19 PM

9 points

3 comments8 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenDec 5, 2024, 7:24 PM

5 points

2 comments10 min readLW link

Gradient Anatomy’s—Hallucination Robustness in Medical Q&A

DieSabFeb 12, 2025, 7:16 PM

2 points

0 comments10 min readLW link

A day in the life of a mechanistic interpretability researcher

Bill BenzonNov 28, 2023, 2:45 PM

3 points

3 comments1 min readLW link

Computational Superposition in a Toy Model of the U-AND Problem

Adam NewgasMar 27, 2025, 4:56 PM

17 points

2 comments11 min readLW link

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping YuDec 26, 2023, 12:36 AM

7 points

1 comment11 min readLW link

What’s going on? LLMs and IS-A sentences

Bill BenzonNov 8, 2023, 4:58 PM

6 points

15 comments4 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23ChenFeb 18, 2025, 10:16 PM

8 points

2 comments10 min readLW link

(www.lesswrong.com)

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

rguntherMay 1, 2025, 8:14 PM

7 points

1 comment1 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

Apr 4, 2025, 2:53 PM

24 points

0 comments8 min readLW link

Tracing Typos in LLMs: My Attempt at Understanding How Models Correct Misspellings

Ivan DostalFeb 2, 2025, 7:56 PM

3 points

1 comment5 min readLW link

Enabling New Applications with Today’s Mechanistic Interpretability Toolkit

ananya_joshiOct 25, 2024, 5:53 PM

3 points

0 comments3 min readLW link

Representation Tuning

Christopher AckermanJun 27, 2024, 5:44 PM

35 points

9 comments13 min readLW link

On the Importance of Open Sourcing Reward Models

elandgreJan 2, 2023, 7:01 PM

18 points

5 comments6 min readLW link

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

Feb 13, 2025, 6:24 PM

83 points

6 comments11 min readLW link

A Chess-GPT Linear Emergent World Representation

Adam KarvonenFeb 8, 2024, 4:25 AM

105 points

14 comments7 min readLW link

(adamkarvonen.github.io)

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasperFeb 19, 2023, 3:25 PM

30 points

5 comments4 min readLW link

The Shard Theory Alignment Scheme

David UdellAug 25, 2022, 4:52 AM

47 points

32 comments2 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaleyDec 7, 2023, 6:14 AM

9 points

0 comments11 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

May 17, 2024, 4:25 PM

57 points

20 comments4 min readLW link

(arxiv.org)

AISC project: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM

22 points

0 comments4 min readLW link

A personal explanation of ELK concept and task.

Zeyu QinOct 6, 2023, 3:55 AM

1 point

0 comments1 min readLW link

Steering LLMs’ Behavior with Concept Activation Vectors

Ruixuan HuangSep 28, 2024, 9:53 AM

8 points

0 comments10 min readLW link

No convincing evidence for gradient descent in activation space

BlaineApr 12, 2023, 4:48 AM

85 points

9 comments20 min readLW link

Progress report 3: clustering transformer neurons

Nathan Helm-BurgerApr 5, 2022, 11:13 PM

5 points

0 comments2 min readLW link

[Question] LLM/AI hype

Student192837465Jun 15, 2024, 8:12 PM

1 point

0 comments1 min readLW link

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Tom AngstenJul 30, 2024, 4:36 PM

6 points

0 comments9 min readLW link

Minor interpretability exploration #2: Extending superposition to different activation functions

Rareș BaronMar 6, 2025, 11:22 AM

1 point

0 comments4 min readLW link

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM

4 points

1 comment6 min readLW link

Interpretability Tools Are an Attack Channel

Thane RuthenisAug 17, 2022, 6:47 PM

42 points

14 comments1 min readLW link

Addendum: More Efficient FFNs via Attention

Robert_AIZIFeb 6, 2023, 6:55 PM

10 points

2 comments5 min readLW link

(aizi.substack.com)

Still no Lie Detector for LLMs

Daniel Herrmann and ben_levinstein

Jul 18, 2023, 7:56 PM

50 points

2 comments21 min readLW link

A proposal for iterated interpretability with known-interpretable narrow AIs

Peter BerggrenJan 11, 2025, 2:43 PM

6 points

0 comments2 min readLW link

An interactive introduction to grokking and mechanistic interpretability

Adam Pearce and Asma Ghandeharioun

Aug 7, 2023, 7:09 PM

23 points

3 comments1 min readLW link

(pair.withgoogle.com)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

Sep 14, 2023, 1:40 AM

32 points

7 comments8 min readLW link

(far.ai)

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

Jan 10, 2025, 11:08 AM

86 points

11 comments17 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM

3 points

7 comments7 min readLW link

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

May 7, 2023, 5:00 PM

45 points

2 comments10 min readLW link

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ntt123Jun 17, 2024, 11:46 AM

5 points

4 comments6 min readLW link

(neuralblog.github.io)

Rohin Shah on reasons for AI optimism

abergalOct 31, 2019, 12:10 PM

40 points

58 comments1 min readLW link

(aiimpacts.org)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

Diego Caples and rrenaud

Sep 6, 2024, 5:55 PM

70 points

7 comments4 min readLW link

Auto-matching hidden layers in Pytorch LLMs

chanindFeb 19, 2024, 12:40 PM

2 points

0 comments3 min readLW link

Knowledge Base 1: Could it increase intelligence and make it safer?

iwisSep 30, 2024, 4:00 PM

−4 points

0 comments4 min readLW link

[Question] SAE sparse feature graph using only residual layers

Jaehyuk LimMay 23, 2024, 1:32 PM

0 points

3 comments1 min readLW link

Antonym Heads Predict Semantic Opposites in Language Models

Jake WardNov 15, 2024, 3:32 PM

3 points

0 comments5 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

Jul 5, 2024, 5:05 PM

65 points

2 comments5 min readLW link

How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

StefanHexJan 24, 2023, 6:45 PM

47 points

5 comments13 min readLW link

Multi-Component Learning and S-Curves

Adam Jermyn and Buck

Nov 30, 2022, 1:37 AM

63 points

24 comments7 min readLW link

Questions I’d Want to Ask an AGI+ to Test Its Understanding of Ethics

sweenesmJan 26, 2024, 11:40 PM

14 points

6 comments4 min readLW link

Sparse Features Through Time

Rogan InglisJun 24, 2024, 6:06 PM

12 points

1 comment1 min readLW link

(roganinglis.io)

Internal Interfaces Are a High-Priority Interpretability Target

Thane RuthenisDec 29, 2022, 5:49 PM

26 points

6 comments7 min readLW link

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRGJul 28, 2023, 8:44 PM

24 points

5 comments20 min readLW link

Explaining SolidGoldMagikarp by looking at it from random directions

Robert_AIZIFeb 14, 2023, 2:54 PM

8 points

0 comments8 min readLW link

(aizi.substack.com)

Exploring OpenAI’s Latent Directions: Tests, Observations, and Poking Around

Johnny LinJan 31, 2024, 6:01 AM

26 points

4 comments14 min readLW link

LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space

NickyPFeb 13, 2023, 6:52 PM

83 points

11 comments15 min readLW link

Trying to approximate Statistical Models as Scoring Tables

JsevillamolJun 29, 2021, 5:20 PM

18 points

2 comments9 min readLW link

Growth and Form in a Toy Model of Superposition

Liam Carroll and Edmund Lau

Nov 8, 2023, 11:08 AM

89 points

7 comments14 min readLW link

An OV-Coherent Toy Model of Attention Head Superposition

Lauren Greenspan and keith_wynroe

Aug 29, 2023, 7:44 PM

26 points

2 comments6 min readLW link

[Question] AI interpretability could be harmful?

Roman LeventovMay 10, 2023, 8:43 PM

13 points

2 comments1 min readLW link

Understanding Hidden Computations in Chain-of-Thought Reasoning

rokosbasiliskAug 24, 2024, 4:35 PM

6 points

1 comment1 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisohamFeb 28, 2025, 8:22 PM

3 points

0 comments9 min readLW link

Mechanistic Interpretability Reading group

1stuserhere and woog

Sep 26, 2023, 4:26 PM

15 points

0 comments1 min readLW link

Research Questions from Stained Glass Windows

StefanHexJun 8, 2022, 12:38 PM

4 points

0 comments2 min readLW link

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis JucysDec 8, 2023, 1:18 PM

16 points

1 comment4 min readLW link

(arxiv.org)

Don’t you mean “the most conditionally forbidden technique?”

Knight LeeApr 26, 2025, 3:45 AM

12 points

0 comments3 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

Nov 9, 2023, 4:16 PM

51 points

0 comments6 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex and Marius Hobbhahn

May 25, 2023, 3:37 PM

71 points

1 comment13 min readLW link

Anomalous Tokens in DeepSeek-V3 and r1

henryJan 25, 2025, 10:55 PM

136 points

3 comments7 min readLW link

Interpretability isn’t Free

Joel BurgetAug 4, 2022, 3:02 PM

12 points

1 comment2 min readLW link

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq, jake_mendel, StefanHex and Kaarel

May 20, 2024, 5:55 PM

23 points

7 comments6 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

CollinDec 15, 2022, 6:22 PM

244 points

39 comments16 min readLW link 1 review

Practical Pitfalls of Causal Scrubbing

Jérémy Scheurer, Phil3, tony, jacquesthibs and David Lindner

Mar 27, 2023, 7:47 AM

87 points

17 comments13 min readLW link

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan AndersDec 19, 2023, 6:56 PM

14 points

0 comments8 min readLW link

(evanhanders.blog)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

Jul 19, 2024, 8:32 PM

59 points

6 comments16 min readLW link

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

Sep 25, 2024, 9:31 AM

73 points

16 comments3 min readLW link

(arxiv.org)

Grammars, subgrammars, and combinatorics of generalization in transformers

Dmitry VaintrobJan 2, 2025, 9:37 AM

36 points

0 comments17 min readLW link

What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.

Joshua ClancyFeb 29, 2024, 6:33 PM

3 points

4 comments128 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

Mar 6, 2024, 5:03 AM

63 points

0 comments12 min readLW link

Task vectors & analogy making in LLMs

SergiiJan 8, 2024, 3:17 PM

9 points

1 comment4 min readLW link

(grgv.xyz)

How To Do Patching Fast

Joseph MillerMay 11, 2024, 8:13 PM

44 points

8 comments4 min readLW link

Introducing SARA: a new activation steering technique

Alejandro TlaieJun 9, 2024, 3:33 PM

17 points

7 comments6 min readLW link

By Default, GPTs Think In Plain Sight

Fabien RogerNov 19, 2022, 7:15 PM

88 points

36 comments9 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul CologneseApr 12, 2023, 3:39 PM

9 points

7 comments12 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

Sep 27, 2024, 5:49 PM

61 points

10 comments4 min readLW link

The Illusion of Transparency as a Trust-Building Mechanism

Priyanka BharadwajMar 19, 2025, 5:09 PM

2 points

0 comments1 min readLW link

How does a toy 2 digit subtraction transformer predict the difference?

Evan AndersDec 22, 2023, 9:17 PM

12 points

0 comments10 min readLW link

(evanhanders.blog)

Towards White Box Deep Learning

Maciej SatkiewiczMar 27, 2024, 6:20 PM

18 points

5 comments1 min readLW link

(arxiv.org)

Interpreting Embedding Spaces by Conceptualization

Adi SimhiFeb 28, 2023, 6:38 PM

3 points

0 comments1 min readLW link

(arxiv.org)

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

Nov 7, 2024, 5:22 AM

66 points

4 comments14 min readLW link

Superposition through Active Learning Lens

akankshancSep 17, 2024, 5:32 PM

1 point

0 comments10 min readLW link

Spectral radii dimensionality reduction computed without gradient calculations

Joseph Van NameMay 28, 2025, 5:06 AM

4 points

2 comments5 min readLW link

A short critique of Omohundro’s “Basic AI Drives”

Soumyadeep BoseDec 19, 2024, 7:19 PM

6 points

0 comments4 min readLW link

OthelloGPT learned a bag of heuristics

jylin04, JackS, Adam Karvonen and Can

Jul 2, 2024, 9:12 AM

111 points

10 comments9 min readLW link

Interpreting autonomous driving agents with attention based architecture

Manav DahraFeb 1, 2025, 11:20 PM

1 point

0 comments11 min readLW link

Visualizing Interpretability

Darold DavisFeb 3, 2025, 7:36 PM

2 points

0 comments4 min readLW link

Progress Report 2

Nathan Helm-BurgerMar 30, 2022, 2:29 AM

4 points

1 comment1 min readLW link

Alignment Gaps

kcyrasJun 8, 2024, 3:23 PM

11 points

4 comments8 min readLW link

LLMs are likely not conscious

research_prime_spaceSep 29, 2024, 8:57 PM

6 points

9 comments1 min readLW link

[Question] Can we isolate neurons that recognize features vs. those which have some other role?

Joshua ClancyOct 21, 2023, 12:30 AM

4 points

2 comments3 min readLW link

Announcing the CNN Interpretability Competition

scasperSep 26, 2023, 4:21 PM

22 points

0 comments4 min readLW link

The Laws of Large Numbers

Dmitry VaintrobJan 4, 2025, 11:54 AM

38 points

11 comments12 min readLW link

Exploring Decomposability of SAE Features

Vikram_NSep 30, 2024, 6:28 PM

1 point

0 comments3 min readLW link

Charbel-Raphaël and Lucius discuss interpretability

Mateusz Bagiński, Charbel-Raphaël and Lucius Bushnaq

Oct 30, 2023, 5:50 AM

111 points

7 comments21 min readLW link

interpreting GPT: the logit lens

nostalgebraistAug 31, 2020, 2:47 AM

230 points

38 comments10 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensenJul 28, 2023, 7:43 PM

13 points

3 comments13 min readLW link

Negative Results on Group SAEs

Josh EngelsMay 6, 2025, 9:49 PM

70 points

3 comments8 min readLW link

A FRESH view of Alignment

robmanApr 16, 2025, 9:40 PM

1 point

0 comments1 min readLW link

Sparse MLP Distillation

slavachalnevJan 15, 2024, 7:39 PM

30 points

3 comments6 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro AristizabalSep 29, 2024, 12:32 AM

6 points

0 comments15 min readLW link

I was Wrong, Simulator Theory is Real

Robert_AIZIApr 26, 2023, 5:45 PM

75 points

7 comments3 min readLW link

(aizi.substack.com)

Grokking Beyond Neural Networks

Jack MillerOct 30, 2023, 5:28 PM

10 points

0 comments2 min readLW link

(arxiv.org)

Gradient surfing: the hidden role of regularization

Jesse HooglandFeb 6, 2023, 3:50 AM

37 points

9 comments14 min readLW link

(www.jessehoogland.com)

Coordinate-Free Interpretability Theory

johnswentworthSep 14, 2022, 11:33 PM

52 points

17 comments5 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasperFeb 21, 2023, 4:59 PM

14 points

4 comments3 min readLW link

Universality and Hidden Information in Concept Bottleneck Models

HoagyApr 5, 2023, 2:00 PM

23 points

0 comments11 min readLW link

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Rareș BaronMar 14, 2025, 3:45 PM

3 points

0 comments3 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

Mar 3, 2025, 7:50 PM

30 points

0 comments5 min readLW link

Approximation is expensive, but the lunch is cheap

Jesse Hoogland and Zach Furman

Apr 19, 2023, 2:19 PM

70 points

3 comments16 min readLW link

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier and Jessica N. Howard

Mar 14, 2024, 6:00 PM

33 points

12 comments15 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

Aug 22, 2024, 7:32 AM

23 points

1 comment21 min readLW link

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

Nov 1, 2024, 7:23 PM

33 points

0 comments11 min readLW link

Natural Categories Update

Logan ZoellnerOct 10, 2022, 3:19 PM

33 points

6 comments2 min readLW link

Research Report: Incorrectness Cascades

Robert_AIZIApr 14, 2023, 12:49 PM

19 points

0 comments10 min readLW link

(aizi.substack.com)

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

Joseph Van NameSep 4, 2023, 4:19 PM

3 points

4 comments12 min readLW link

EIS III: Broad Critiques of Interpretability Research

scasperFeb 14, 2023, 6:24 PM

20 points

2 comments11 min readLW link

Open Challenges in Representation Engineering

Jan Wehner and Daniel Tan

Apr 3, 2025, 7:21 PM

14 points

0 comments5 min readLW link

An exploration of GPT-2′s embedding weights

Adam ScherlisDec 13, 2022, 12:46 AM

44 points

4 comments10 min readLW link

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

Maria Kapros, Ana Kapros and Perusha Moodley

Apr 21, 2025, 6:12 PM

9 points

0 comments5 min readLW link

LLM misalignment can probably be found without manual prompt engineering

ProgramCrafterJul 8, 2023, 2:35 PM

1 point

0 comments1 min readLW link

Against Emergent Understanding: A Semantic Drift Model for LLMs

datashrimpMay 22, 2025, 4:47 AM

1 point

0 comments7 min readLW link

GPT-2 Sometimes Fails at IOI

Ronak_MehtaAug 14, 2024, 11:24 PM

13 points

0 comments2 min readLW link

(ronakrm.github.io)

An Interpretability Illusion from Population Statistics in Causal Analysis

Daniel TanJul 29, 2024, 2:50 PM

9 points

3 comments1 min readLW link

Gradient hacking

evhubOct 16, 2019, 12:53 AM

107 points

39 comments3 min readLW link 2 reviews

Gradient descent might see the direction of the optimum from far away

Mikhail SaminJul 28, 2023, 4:19 PM

70 points

13 comments4 min readLW link

Effects of Non-Uniform Sparsity on Superposition in Toy Models

Shreyans JainNov 14, 2024, 4:59 PM

4 points

3 comments6 min readLW link

Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

Dec 23, 2023, 2:46 AM

18 points

0 comments4 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

Apr 30, 2024, 5:58 PM

74 points

14 comments17 min readLW link

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

Jan WehnerJul 14, 2024, 10:37 AM

37 points

6 comments17 min readLW link

CNN feature visualization in 50 lines of code

StefanHexMay 26, 2022, 11:02 AM

17 points

4 comments5 min readLW link

New Tool: the Residual Stream Viewer

AdamYedidiaOct 1, 2023, 12:49 AM

32 points

7 comments4 min readLW link

(tinyurl.com)

The AI Control Problem in a wider intellectual context

philosophybearJan 13, 2023, 12:28 AM

11 points

3 comments12 min readLW link

Workshop: Interpretability in LLMs using Geometric and Statistical Methods

Karthik ViswanathanFeb 22, 2025, 9:39 AM

17 points

0 comments8 min readLW link

Has anyone experimented with Dodrio, a tool for exploring transformer models through interactive visualization?

Bill BenzonDec 11, 2023, 8:34 PM

4 points

0 comments1 min readLW link

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

Apr 15, 2025, 12:33 PM

35 points

3 comments4 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin FarrellApr 15, 2024, 6:21 PM

29 points

7 comments12 min readLW link

Activation space interpretability may be doomed

bilalchughtai and Lucius Bushnaq

Jan 8, 2025, 12:49 PM

148 points

33 comments8 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMCJan 11, 2022, 11:28 AM

19 points

6 comments8 min readLW link

Understanding the Information Flow inside Large Language Models

Felix Hofstätter and cozyfractal

Aug 15, 2023, 9:13 PM

19 points

0 comments17 min readLW link

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

Aug 2, 2024, 7:50 PM

38 points

1 comment9 min readLW link

Possible research directions to improve the mechanistic explanation of neural networks

delton137Nov 9, 2021, 2:36 AM

31 points

8 comments9 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

Oct 3, 2023, 7:45 AM

17 points

0 comments5 min readLW link

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș BaronFeb 26, 2025, 11:35 AM

3 points

13 comments4 min readLW link

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

PaulPaulsNov 24, 2024, 5:45 AM

19 points

3 comments1 min readLW link

(github.com)

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

Jul 24, 2022, 10:31 PM

30 points

2 comments7 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

Aug 24, 2024, 12:56 AM

68 points

10 comments20 min readLW link

Finding Features Causally Upstream of Refusal

Daniel Lee, Eric Breck and Andy Arditi

Jan 14, 2025, 2:30 AM

54 points

5 comments12 min readLW link

A Bite Sized Introduction to ELK

Luk27182Sep 17, 2022, 12:28 AM

5 points

0 comments6 min readLW link

The shallow reality of ‘deep learning theory’

Jesse HooglandFeb 22, 2023, 4:16 AM

34 points

11 comments3 min readLW link

(www.jessehoogland.com)

Why does Claude Speak Byzantine Music Notation?

Lennart FinkeMar 31, 2025, 3:13 PM

18 points

2 comments3 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

Jul 20, 2024, 2:20 AM

61 points

0 comments4 min readLW link

Short Remark on the (subjective) mathematical ‘naturalness’ of the Nanda—Lieberum addition modulo 113 algorithm

carboniferous_umbraculum Jun 1, 2023, 11:31 AM

104 points

12 comments2 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jan 16, 2024, 12:26 AM

84 points

9 comments18 min readLW link

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZIOct 7, 2023, 11:30 PM

137 points

8 comments4 min readLW link

Some common confusion about induction heads

Alexandre VariengienMar 28, 2023, 9:51 PM

64 points

4 comments5 min readLW link

Impact stories for model internals: an exercise for interpretability researchers

jennySep 25, 2023, 11:15 PM

29 points

3 comments7 min readLW link

A multi-disciplinary view on AI safety research

Roman LeventovFeb 8, 2023, 4:50 PM

46 points

4 comments26 min readLW link

Toy Models and Tegum Products

Adam JermynNov 4, 2022, 6:51 PM

28 points

7 comments5 min readLW link

Graphical tensor notation for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM

141 points

11 comments19 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

Nov 7, 2024, 10:07 PM

47 points

0 comments1 min readLW link

(arxiv.org)

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

Dec 7, 2024, 6:10 PM

70 points

0 comments12 min readLW link

EIS XI: Moving Forward

scasperFeb 22, 2023, 7:05 PM

19 points

2 comments9 min readLW link

GPT-4.5 is Cognitive Empathy, Sonnet 3.5 is Affective Empathy

JackApr 16, 2025, 7:12 PM

15 points

2 comments4 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM

119 points

15 comments22 min readLW link

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

Sep 6, 2024, 2:28 AM

28 points

0 comments12 min readLW link

Internal Target Information for AI Oversight

Paul CologneseOct 20, 2023, 2:53 PM

15 points

0 comments5 min readLW link

What is a circuit? [in interpretability]

Yudhister KumarFeb 14, 2025, 4:40 AM

23 points

1 comment1 min readLW link

Topological Data Analysis and Mechanistic Interpretability

Gunnar CarlssonFeb 24, 2025, 7:56 PM

16 points

4 comments7 min readLW link

Redundant Attention Heads in Large Language Models For In Context Learning

skunnavakkamSep 1, 2024, 8:08 PM

7 points

2 comments4 min readLW link

(skunnavakkam.github.io)

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae ChoFeb 26, 2025, 5:05 PM

4 points

3 comments17 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie YangAug 28, 2024, 8:41 AM

7 points

2 comments31 min readLW link

Dissected boxed AI

Nathan1123Aug 12, 2022, 2:37 AM

−8 points

2 comments1 min readLW link

Bridging the VLM and mech interp communities for multimodal interpretability

Sonia JosephOct 28, 2024, 2:41 PM

19 points

5 comments15 min readLW link

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Baram Sosis and Tomáš Gavenčiak

Apr 18, 2025, 10:56 PM

8 points

0 comments13 min readLW link

[Question] Barcoding LLM Training Data Subsets. Anyone trying this for interpretability?

right..enough?Apr 13, 2024, 3:09 AM

7 points

0 comments7 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

May 9, 2024, 6:40 AM

4 points

0 comments5 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

Mar 25, 2024, 9:17 PM

93 points

7 comments7 min readLW link

‘Fundamental’ vs ‘applied’ mechanistic interpretability research

Lee SharkeyMay 23, 2023, 6:26 PM

65 points

6 comments3 min readLW link

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. MurphyMay 12, 2022, 7:59 PM

16 points

0 comments8 min readLW link

Categorical Organization in Memory: ChatGPT Organizes the 665 Topic Tags from My New Savanna Blog

Bill BenzonDec 14, 2023, 1:02 PM

0 points

6 comments2 min readLW link

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

Nov 1, 2023, 12:46 PM

18 points

1 comment7 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jul 18, 2024, 10:29 AM

67 points

0 comments10 min readLW link

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du and Buck

Oct 12, 2022, 9:25 PM

50 points

11 comments4 min readLW link

Useful starting code for interpretability

eggsyntaxFeb 13, 2024, 11:13 PM

26 points

2 comments1 min readLW link

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub and Nicholas Schiefer

Nov 18, 2022, 1:43 AM

75 points

7 comments3 min readLW link

(arxiv.org)

Reflections on Trusting Trust & AI

Itay YonaJan 16, 2023, 6:36 AM

10 points

1 comment3 min readLW link

(mentaleap.ai)

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

Jul 21, 2023, 2:52 PM

56 points

1 comment1 min readLW link

Testing “True” Language Understanding in LLMs: A Simple Proposal

MtryaSamNov 2, 2024, 7:12 PM

9 points

2 comments2 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola AyonrindeOct 30, 2024, 10:50 PM

27 points

0 comments12 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul CologneseFeb 26, 2024, 5:31 AM

22 points

11 comments4 min readLW link

A Short Memo on AI Interpretability Rainbows

scasperJul 27, 2023, 11:05 PM

18 points

0 comments2 min readLW link

Bing AI Generating Voynich Manuscript Continuations—It does not know how it knows

Matthew_OpitzApr 10, 2023, 8:22 PM

15 points

6 comments13 min readLW link

No Really, Attention is ALL You Need—Attention can do feedforward networks

Robert_AIZIJan 31, 2023, 6:48 PM

29 points

7 comments6 min readLW link

(aizi.substack.com)

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

Aug 12, 2024, 8:34 PM

31 points

1 comment9 min readLW link

Another list of theories of impact for interpretability

Beth BarnesApr 13, 2022, 1:29 PM

33 points

1 comment5 min readLW link

Toy Models of Superposition: what about BitNets?

Alejandro TlaieAug 8, 2024, 4:29 PM

5 points

1 comment5 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_cApr 7, 2022, 1:46 PM

11 points

0 comments7 min readLW link

Deep neural networks are not opaque.

jem-mosigJul 6, 2022, 6:03 PM

22 points

14 comments3 min readLW link

EIS VII: A Challenge for Mechanists

scasperFeb 18, 2023, 6:27 PM

36 points

4 comments3 min readLW link

Investigating the learning coefficient of modular addition: hackathon project

Nina Panickssery and Dmitry Vaintrob

Oct 17, 2023, 7:51 PM

94 points

5 comments12 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

Feb 16, 2024, 6:32 PM

86 points

4 comments10 min readLW link

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

Sep 26, 2024, 1:44 PM

42 points

4 comments1 min readLW link

(arxiv.org)

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Sep 29, 2024, 4:04 PM

60 points

20 comments10 min readLW link

Can LLMs Simulate Internal Evaluation? A Case Study in Self-Generated Recommendations

The Neutral MindMay 1, 2025, 7:04 PM

4 points

0 comments2 min readLW link

Mech Interp Challenge: August—Deciphering the First Unique Character Model

CallumMcDougallAug 9, 2023, 7:14 PM

36 points

1 comment3 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

Mar 25, 2025, 2:33 PM

22 points

0 comments18 min readLW link

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

Oct 20, 2023, 7:32 AM

31 points

6 comments25 min readLW link

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Scott EmmonsMay 31, 2023, 5:09 PM

97 points

1 comment6 min readLW link 1 review

EIS IV: A Spotlight on Feature Attribution/Saliency

scasperFeb 15, 2023, 6:46 PM

19 points

1 comment4 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh LevyJun 4, 2024, 3:45 PM

39 points

0 comments18 min readLW link

Emergence, The Blind Spot of GenAI Interpretability?

Quentin FEUILLADE--MONTIXIAug 10, 2024, 10:07 AM

16 points

8 comments3 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica RumbelowNov 17, 2022, 11:06 AM

27 points

2 comments2 min readLW link

Finding an Error-Detection Feature in DeepSeek-R1

keith_wynroeApr 24, 2025, 4:03 PM

15 points

0 comments7 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

Jul 2, 2024, 1:17 PM

86 points

7 comments12 min readLW link

GPT-2′s positional embedding matrix is a helix

AdamYedidiaJul 21, 2023, 4:16 AM

44 points

21 comments4 min readLW link

Myrinax? I want to have people see this !

thomasApr 13, 2025, 6:51 PM

1 point

0 comments1 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

Aug 23, 2024, 6:52 PM

42 points

8 comments16 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

JustausernameAug 24, 2023, 3:53 AM

1 point

0 comments6 min readLW link

Towards Understanding the Representation of Belief State Geometry in Transformers

Karthik ViswanathanApr 18, 2025, 12:39 PM

3 points

0 comments12 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

Jan 14, 2024, 2:06 AM

24 points

0 comments42 min readLW link

Activation Pattern SVD: A proposal for SAE Interpretability

Daniel TanJun 28, 2024, 10:12 PM

15 points

2 comments2 min readLW link

My January alignment theory Nanowrimo

Dmitry VaintrobJan 2, 2025, 12:07 AM

42 points

2 comments2 min readLW link

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt TiggesApr 3, 2023, 4:09 PM

21 points

2 comments11 min readLW link

(blog.eleuther.ai)

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders and Joseph Bloom

Feb 27, 2024, 2:43 AM

43 points

16 comments15 min readLW link

Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)Nov 3, 2022, 11:19 PM

28 points

3 comments1 min readLW link

Thoughts On (Solving) Deep Deception

JozdienOct 21, 2023, 10:40 PM

72 points

6 comments6 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:58 AM

206 points

35 comments20 min readLW link 1 review

Mechanistic interpretability of LLM analogy-making

SergiiOct 20, 2023, 12:53 PM

2 points

0 comments4 min readLW link

(grgv.xyz)

Boundary Conditions: A Solution to the Symbol Grounding Problem, and a Warning

ISCApr 8, 2025, 6:42 AM

1 point

0 comments5 min readLW link

Announcing Timaeus

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

Oct 22, 2023, 11:59 AM

188 points

15 comments4 min readLW link

How Language Models Understand Nullability

Anish Tondwalkar and Alex Sanchez-Stern

Mar 11, 2025, 3:57 PM

5 points

0 comments2 min readLW link

(dmodel.ai)

Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatterDec 24, 2021, 7:24 AM

12 points

3 comments1 min readLW link

(www.youtube.com)

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

Jan 25, 2023, 7:03 PM

48 points

6 comments12 min readLW link

Gender Vectors in ROME’s Latent Space

XodarapMay 21, 2023, 6:46 PM

14 points

2 comments3 min readLW link

Large language models learn to represent the world

gjmJan 22, 2023, 1:10 PM

101 points

20 comments3 min readLW link 1 review

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob DunefskySep 12, 2023, 5:38 PM

16 points

2 comments29 min readLW link

A comparison of causal scrubbing, causal abstractions, and related methods

Erik Jenner, Adrià Garriga-alonso and Egor Zverev

Jun 8, 2023, 11:40 PM

73 points

3 comments22 min readLW link

How polysemantic can one neuron be? Investigating features in TinyStories.

Evan AndersJan 16, 2024, 7:10 PM

14 points

0 comments8 min readLW link

(evanhanders.blog)

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi ZhangJan 24, 2025, 7:02 PM

7 points

0 comments11 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

Jul 18, 2024, 5:02 PM

9 points

0 comments1 min readLW link

(arxiv.org)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy FarnikFeb 26, 2025, 12:50 PM

79 points

8 comments7 min readLW link

Empirical risk minimization is fundamentally confused

Jesse HooglandMar 22, 2023, 4:58 PM

32 points

8 comments1 min readLW link

Some miscellaneous thoughts on ChatGPT, stories, and mechanical interpretability

Bill BenzonFeb 4, 2023, 7:35 PM

2 points

0 comments3 min readLW link

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

Apr 5, 2025, 6:24 AM

110 points

7 comments7 min readLW link

Sparse Autoencoder Features for Classifications and Transferability

Shan23ChenFeb 18, 2025, 10:14 PM

5 points

0 comments1 min readLW link

(arxiv.org)

[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings

Bogdan Ionut CirsteaJul 21, 2023, 11:26 AM

5 points

0 comments1 min readLW link

Auditing games for high-level interpretability

Paul CologneseNov 1, 2022, 10:44 AM

33 points

1 comment7 min readLW link

Research Adenda: Modelling Trajectories of Language Models

NickyPNov 13, 2023, 2:33 PM

28 points

0 comments12 min readLW link

What’s going on with Per-Component Weight Updates?

4gateAug 22, 2024, 9:22 PM

1 point

0 comments6 min readLW link

SAE vs. RepE

Stephen MartinMay 20, 2025, 5:09 AM

4 points

3 comments2 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

Jul 13, 2024, 5:19 PM

39 points

12 comments12 min readLW link

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasperFeb 17, 2023, 8:48 PM

49 points

9 comments12 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman LeventovMay 29, 2023, 11:08 AM

12 points

10 comments30 min readLW link

Spreadsheet for 200 Concrete Problems In Interpretability

Jay BaileyMar 29, 2023, 6:51 AM

13 points

0 comments1 min readLW link

Challenge: know everything that the best go bot knows about go

DanielFilanMay 11, 2021, 5:10 AM

48 points

113 comments2 min readLW link

(danielfilan.com)

My current thinking about ChatGPT @3QD [Gärdenfors, Wolfram, and the value of speculation]

Bill BenzonMar 1, 2023, 10:50 AM

2 points

0 comments5 min readLW link

Decision Transformer Interpretability

Joseph Bloom and Paul Colognese

Feb 6, 2023, 7:29 AM

84 points

13 comments24 min readLW link

Reflections on Neuralese

Alice BlairMar 12, 2025, 4:29 PM

27 points

0 comments5 min readLW link

Do sparse autoencoders find “true features”?

Demian TillFeb 22, 2024, 6:06 PM

74 points

33 comments11 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

Aug 23, 2024, 10:03 PM

17 points

0 comments25 min readLW link

RFC: a tool to create a ranked list of projects in explainable AI

eamagApr 6, 2025, 9:18 PM

2 points

0 comments1 min readLW link

(eamag.me)

Input Swap Graphs: Discovering the role of neural network components at scale

Alexandre VariengienMay 12, 2023, 9:41 AM

92 points

0 comments33 min readLW link

Attributing to interactions with GCPD and GWPD

jennyOct 11, 2023, 3:06 PM

20 points

0 comments6 min readLW link

[Question] Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?

simeon_cDec 31, 2022, 11:34 AM

8 points

5 comments1 min readLW link

Language models can explain neurons in language models

nzMay 9, 2023, 5:29 PM

23 points

0 comments1 min readLW link

(openai.com)

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensenAug 17, 2023, 1:53 PM

21 points

0 comments14 min readLW link

Composition Circuits in Vision Transformers (Hypothesis)

phenomanonNov 1, 2024, 10:16 PM

1 point

0 comments3 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Oct 27, 2024, 6:46 PM

48 points

4 comments5 min readLW link

Interpretable by Design—Constraint Sets with Disjoint Limit Points

Ronak_MehtaMay 8, 2025, 9:08 PM

23 points

1 comment9 min readLW link

(ronakrm.github.io)

Can Large Language Models effectively identify cybersecurity risks?

emile delcourtAug 30, 2024, 8:20 PM

18 points

0 comments11 min readLW link

[Question] Does a broad overview of Mechanistic Interpretability exist?

kourabiOct 16, 2023, 1:16 AM

1 point

0 comments1 min readLW link

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten and Ami Hays

Aug 5, 2023, 5:55 PM

6 points

2 comments7 min readLW link

(drive.google.com)

Anomalous Concept Detection for Detecting Hidden Cognition

Paul CologneseMar 4, 2024, 4:52 PM

24 points

3 comments10 min readLW link

What would it mean to understand how a large language model (LLM) works? Some quick notes.

Bill BenzonOct 3, 2023, 3:11 PM

20 points

4 comments8 min readLW link

Relational Alignment: Trust, Repair, and the Emotional Work of AI

Priyanka BharadwajMay 8, 2025, 2:44 AM

3 points

0 comments3 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibsDec 5, 2022, 1:36 PM

20 points

11 comments2 min readLW link

Testing which LLM architectures can do hidden serial reasoning

Filip SondejDec 16, 2024, 1:48 PM

81 points

9 comments4 min readLW link

Truth is Universal: Robust Detection of Lies in LLMs

Lennart BuergerJul 19, 2024, 2:07 PM

24 points

3 comments2 min readLW link

(arxiv.org)

ChatGPT tells 20 versions of its prototypical story, with a short note on method

Bill BenzonOct 14, 2023, 3:27 PM

6 points

0 comments5 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido BergmanSep 27, 2024, 5:49 PM

7 points

2 comments9 min readLW link

Why I’m Working On Model Agnostic Interpretability

Jessica RumbelowNov 11, 2022, 9:24 AM

27 points

9 comments2 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

Dec 3, 2022, 12:59 AM

34 points

2 comments30 min readLW link

Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis

Matt LevinsonJan 10, 2025, 6:53 AM

4 points

0 comments4 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougallDec 14, 2021, 11:14 PM

39 points

9 comments19 min readLW link

Induction heads—illustrated

CallumMcDougallJan 2, 2023, 3:35 PM

128 points

12 comments3 min readLW link

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas, nlpet and hal2k

Jan 6, 2025, 10:24 AM

20 points

6 comments10 min readLW link

Decompiling Tracr Transformers—An interpretability experiment

Hannes ThurnherrMar 27, 2024, 9:49 AM

4 points

0 comments14 min readLW link

Understanding Positional Features in Layer 0 SAEs

bilalchughtai and Yeu-Tong Lau

Jul 29, 2024, 9:36 AM

43 points

0 comments5 min readLW link

[Question] Would it be useful to collect the contexts, where various LLMs think the same?

Martin VlachAug 24, 2023, 10:01 PM

6 points

1 comment1 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

Mar 11, 2024, 12:16 AM

68 points

0 comments14 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

Feb 23, 2023, 8:14 PM

51 points

0 comments19 min readLW link

No comments.

In­ter­pretabil­ity (ML & AI)

See Also

Research

Interpretability (ML & AI)