Interpretability (ML & AI)

TagLast edit: 22 Jan 2025 16:27 UTC by Dakara

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model’s output, but the model can’t tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification “horse”.

A small update to the Sparse Coding interim research report

Lee Sharkey, Dan Braun and beren

30 Apr 2023 19:54 UTC

62 points

5 comments1 min readLW link

Interpretability in ML: A Broad Overview

lifelonglearner4 Aug 2020 19:03 UTC

53 points

5 comments15 min readLW link

Timaeus’s First Four Months

Jesse Hoogland, Daniel Murfet, Stan van Wingerden and Alexander Gietelink Oldenziel

28 Feb 2024 17:01 UTC

173 points

6 comments6 min readLW link

Toward A Mathematical Framework for Computation in Superposition

Dmitry Vaintrob, jake_mendel and Kaarel

18 Jan 2024 21:06 UTC

214 points

19 comments63 min readLW link

A Mechanistic Interpretability Analysis of Grokking

Neel Nanda and Tom Lieberum

15 Aug 2022 2:41 UTC

377 points

48 comments36 min readLW link 1 review

(colab.research.google.com)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey, Dan Braun and beren

13 Dec 2022 15:41 UTC

156 points

23 comments22 min readLW link 2 reviews

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

7 Feb 2025 19:35 UTC

78 points

12 comments14 min readLW link

A Longlist of Theories of Impact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC

128 points

41 comments5 min readLW link 2 reviews

200 Concrete Open Problems in Mechanistic Interpretability: Introduction

Neel Nanda28 Dec 2022 21:06 UTC

108 points

0 comments10 min readLW link

Re-Examining LayerNorm

Eric Winsor1 Dec 2022 22:20 UTC

128 points

12 comments5 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

3 May 2023 13:30 UTC

33 points

6 comments2 min readLW link 1 review

(arxiv.org)

Chris Olah’s views on AGI safety

evhub1 Nov 2019 20:13 UTC

213 points

40 comments12 min readLW link 2 reviews

Try training token-level probes

StefanHex14 Apr 2025 11:56 UTC

47 points

6 comments8 min readLW link

Searching for Search

Niki Dupuis and janus

28 Nov 2022 15:31 UTC

98 points

9 comments14 min readLW link 1 review

Compressed Computation is (probably) not Computation in Superposition

Jai Bhagat, Sara Molas Medina, Giorgi Giglemiani and StefanHex

23 Jun 2025 19:35 UTC

59 points

9 comments10 min readLW link

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren and Sid Black

28 Nov 2022 12:54 UTC

200 points

34 comments31 min readLW link

A Rocket–Interpretability Analogy

plex21 Oct 2024 13:55 UTC

166 points

33 comments1 min readLW link 1 review

How To Go From Interpretability To Alignment: Just Retarget The Search

johnswentworth10 Aug 2022 16:08 UTC

214 points

34 comments3 min readLW link 1 review

[Question] Papers to start getting into NLP-focused alignment research

Feraidoon24 Sep 2022 23:53 UTC

6 points

0 comments1 min readLW link

Tracing the Thoughts of a Large Language Model

Adam Jermyn27 Mar 2025 17:20 UTC

308 points

23 comments10 min readLW link

(www.anthropic.com)

Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël17 Aug 2023 18:44 UTC

336 points

93 comments26 min readLW link 2 reviews

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC

289 points

22 comments2 min readLW link 1 review

(transformer-circuits.pub)

Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide22 Jul 2024 18:45 UTC

118 points

20 comments12 min readLW link

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

7 May 2023 0:46 UTC

79 points

24 comments9 min readLW link

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey3 Apr 2024 12:34 UTC

97 points

23 comments22 min readLW link

A transparency and interpretability tech tree

evhub16 Jun 2022 23:44 UTC

163 points

11 comments18 min readLW link 1 review

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini and Ciprian Florea

29 Sep 2024 19:37 UTC

28 points

8 comments25 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

226 points

11 comments8 min readLW link

How Interpretability can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC

19 points

0 comments37 min readLW link

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Connor Leahy, beren, CRG, merizian, Eric Winsor and Dan Braun

23 Sep 2022 17:58 UTC

149 points

29 comments33 min readLW link

ParaScopes: Do Language Models Plan the Upcoming Paragraph?

NickyP21 Feb 2025 16:50 UTC

41 points

2 comments20 min readLW link

How to use and interpret activation patching

StefanHex and Neel Nanda

24 Apr 2024 8:35 UTC

16 points

7 comments19 min readLW link

Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley5 Jan 2024 8:46 UTC

37 points

4 comments2 min readLW link

Transparency and AGI safety

jylin0411 Jan 2021 18:51 UTC

54 points

12 comments30 min readLW link

The ‘strong’ feature hypothesis could be wrong

lewis smith2 Aug 2024 14:33 UTC

235 points

29 comments17 min readLW link 1 review

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

677 points

208 comments12 min readLW link 1 review

Ideation and Trajectory Modelling in Language Models

NickyP5 Oct 2023 19:21 UTC

16 points

2 comments10 min readLW link

Transformer Circuits

evhub22 Dec 2021 21:09 UTC

145 points

4 comments3 min readLW link

(transformer-circuits.pub)

Takeaways From 3 Years Working In Machine Learning

George3d68 Apr 2022 17:14 UTC

35 points

10 comments11 min readLW link

(www.epistem.ink)

SAE reconstruction errors are (empirically) pathological

wesg29 Mar 2024 16:37 UTC

108 points

16 comments8 min readLW link

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

25 Jan 2025 13:12 UTC

109 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC

103 points

37 comments15 min readLW link

SAE regularization produces more interpretable models

Peter Lai and StefanHex

28 Jan 2025 20:02 UTC

21 points

7 comments4 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

32 points

14 comments9 min readLW link

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

441 points

98 comments50 min readLW link 1 review

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

17 Mar 2020 20:23 UTC

39 points

1 comment11 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

23 Oct 2023 16:33 UTC

33 points

2 comments11 min readLW link

SAE feature geometry is outside the superposition hypothesis

jake_mendel24 Jun 2024 16:07 UTC

229 points

18 comments11 min readLW link 1 review

The Case for Radical Optimism about Interpretability

Quintin Pope16 Dec 2021 23:38 UTC

66 points

16 comments8 min readLW link 1 review

Actually, Othello-GPT Has A Linear Emergent World Representation

Neel Nanda29 Mar 2023 22:13 UTC

214 points

26 comments19 min readLW link

(neelnanda.io)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

108 points

4 comments3 min readLW link

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

lifelonglearner and Peter Hase

9 Apr 2021 19:19 UTC

142 points

17 comments102 min readLW link

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:44 UTC

109 points

12 comments22 min readLW link 2 reviews

Comments on Anthropic’s Scaling Monosemanticity

Robert_AIZI3 Jun 2024 12:15 UTC

98 points

8 comments7 min readLW link

My tentative interpretability research agenda—topology matching.

Maxwell Clarke8 Oct 2022 22:14 UTC

10 points

2 comments4 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC

240 points

37 comments8 min readLW link 1 review

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

45 points

3 comments5 min readLW link

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey14 Jul 2022 16:59 UTC

119 points

15 comments33 min readLW link

MATS Applications + Research Directions I’m Currently Excited About

Neel Nanda6 Feb 2025 11:03 UTC

73 points

7 comments8 min readLW link

Language Models Use Trigonometry to Do Addition

Subhash Kantamneni5 Feb 2025 13:50 UTC

80 points

1 comment10 min readLW link

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Neel Nanda21 Dec 2022 12:35 UTC

91 points

6 comments2 min readLW link

(neelnanda.io)

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Logan Riggs and Gurkenglas

3 Sep 2020 18:27 UTC

68 points

11 comments2 min readLW link

LLMs Universally Learn a Feature Representing Token Frequency / Rarity

Sean Osier30 Jun 2024 2:48 UTC

13 points

5 comments6 min readLW link

(github.com)

Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda4 May 2025 16:32 UTC

341 points

69 comments7 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

125 points

18 comments18 min readLW link

Mechanistic Transparency for Machine Learning

DanielFilan11 Jul 2018 0:34 UTC

55 points

9 comments4 min readLW link

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasper11 Oct 2024 22:13 UTC

68 points

4 comments7 min readLW link

200 COP in MI: Interpreting Algorithmic Problems

Neel Nanda31 Dec 2022 19:55 UTC

33 points

2 comments10 min readLW link

Can activation verbalizers surface an internal chain of thought?

oakhu and ryan_greenblatt

7 Jun 2026 4:24 UTC

122 points

0 comments16 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin Pope13 Oct 2021 20:52 UTC

9 points

0 comments2 min readLW link

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

6 Aug 2024 10:33 UTC

11 points

0 comments1 min readLW link

(blog.eleuther.ai)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

256 points

96 comments10 min readLW link 1 review

Deep learning models might be secretly (almost) linear

beren24 Apr 2023 18:43 UTC

117 points

29 comments4 min readLW link

Basic facts about language models during training

beren21 Feb 2023 11:46 UTC

103 points

15 comments18 min readLW link

Interpreting and Steering Features in Images

Gytis Daujotas20 Jun 2024 18:33 UTC

67 points

6 comments5 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

94 points

9 comments11 min readLW link

Introduction to inaccessible information

Ryan Kidd9 Dec 2021 1:28 UTC

27 points

6 comments8 min readLW link

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

14 Dec 2022 14:33 UTC

29 points

5 comments11 min readLW link

Theories of impact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC

25 points

0 comments11 min readLW link

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

24 Jun 2024 19:27 UTC

104 points

4 comments8 min readLW link

(arxiv.org)

LLM Modularity: The Separability of Capabilities in Large Language Models

NickyP26 Mar 2023 21:57 UTC

103 points

3 comments41 min readLW link

(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders

Logan Riggs5 Jul 2023 16:49 UTC

60 points

1 comment7 min readLW link

Is Interpretability All We Need?

RogerDearnaley14 Nov 2023 5:31 UTC

2 points

1 comment1 min readLW link

Should we publish mechanistic interpretability research?

Marius Hobbhahn and LawrenceC

21 Apr 2023 16:19 UTC

106 points

41 comments13 min readLW link

[Question] Can you MRI a deep learning model?

Yair Halberstadt13 Jun 2022 13:43 UTC

3 points

3 comments1 min readLW link

Article Review: Google’s AlphaTensor

Robert_AIZI12 Oct 2022 18:04 UTC

8 points

4 comments10 min readLW link

A New Class of Glitch Tokens—BPE Subtoken Artifacts (BSA)

Lao Mein20 Sep 2024 13:13 UTC

37 points

8 comments5 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

17 Mar 2025 22:27 UTC

27 points

0 comments11 min readLW link

Anthropic & Dario’s dream

Simon Lermen8 Nov 2025 1:19 UTC

55 points

1 comment5 min readLW link

[Linkpost] Interpretability Dreams

DanielFilan24 May 2023 21:08 UTC

39 points

2 comments2 min readLW link

(transformer-circuits.pub)

Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability

submarat, Joachim Schaeffer, Luca Baroni, galvsk and StefanHex

23 Jul 2025 14:55 UTC

31 points

0 comments7 min readLW link

EIS V: Blind Spots In AI Safety Interpretability Research

scasper16 Feb 2023 19:09 UTC

58 points

24 comments10 min readLW link

Semantic Phonons: Lattice Vibrations in AI Internals

Lukas Bongartz11 May 2026 4:04 UTC

15 points

0 comments17 min readLW link

Tests of LLM introspection need to rule out causal bypassing

Adam Morris and Dillon Plunkett

28 Nov 2025 17:42 UTC

51 points

6 comments4 min readLW link

Is Interpretability for Control or for Science?

James Enouen28 Jul 2025 21:12 UTC

3 points

0 comments3 min readLW link

Misrepresentation as a Barrier for Interp (Part I)

johnswentworth and Steve Petersen

29 Apr 2025 17:07 UTC

113 points

12 comments7 min readLW link

SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC

72 points

18 comments6 min readLW link

Against blanket arguments against interpretability

Dmitry Vaintrob22 Jan 2025 9:46 UTC

54 points

4 comments7 min readLW link

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs and Jannik Brinkmann

15 Mar 2024 16:30 UTC

26 points

5 comments4 min readLW link

Deep Learning is cheap Solomonoff induction?

Lucius Bushnaq, Kaarel and Dmitry Vaintrob

7 Dec 2024 11:00 UTC

46 points

1 comment17 min readLW link

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

5 Aug 2024 22:20 UTC

62 points

13 comments10 min readLW link

Apply for the 2023 Developmental Interpretability Conference!

Stan van Wingerden, Alexander Gietelink Oldenziel, Jesse Hoogland and Daniel Murfet

25 Aug 2023 7:12 UTC

33 points

0 comments2 min readLW link

Intervening in the Residual Stream

MadHatter22 Feb 2023 6:29 UTC

30 points

1 comment9 min readLW link

AXRP Episode 21 - Interpretability for Engineers with Stephen Casper

DanielFilan2 May 2023 0:50 UTC

12 points

1 comment66 min readLW link

The Computational Complexity of Circuit Discovery for Inner Interpretability

Bogdan Ionut Cirstea17 Oct 2024 13:18 UTC

11 points

2 comments1 min readLW link

(arxiv.org)

Progress Report 1: interpretability experiments & learning, testing compression hypotheses

Nathan Helm-Burger22 Mar 2022 20:12 UTC

11 points

0 comments2 min readLW link

Question 3: Control proposals for minimizing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC

5 points

1 comment7 min readLW link

Rational Animations’ intro to mechanistic interpretability

Writer14 Jun 2024 16:10 UTC

45 points

1 comment11 min readLW link

(youtu.be)

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

34 points

5 comments2 min readLW link

Can we efficiently explain model behaviors?

paulfchristiano16 Dec 2022 19:40 UTC

64 points

3 comments9 min readLW link

(ai-alignment.com)

World-Model Interpretability Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC

36 points

22 comments21 min readLW link

Anthropic announces interpretability advances. How much does this advance alignment?

Seth Herd21 May 2024 22:30 UTC

49 points

4 comments3 min readLW link

(www.anthropic.com)

Causal Graphs of GPT-2-Small’s Residual Stream

David Udell9 Jul 2024 22:06 UTC

53 points

7 comments7 min readLW link

Mapping the semantic void: Strange goings-on in GPT embedding spaces

mwatkins14 Dec 2023 13:10 UTC

115 points

31 comments14 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda7 Jul 2024 17:39 UTC

146 points

17 comments25 min readLW link 1 review

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

Jessica Rumbelow3 Aug 2024 12:07 UTC

43 points

2 comments4 min readLW link

Scalable End-to-End Interpretability

jsteinhardt18 Dec 2025 22:37 UTC

121 points

3 comments3 min readLW link

On Developing a Mathematical Theory of Interpretability

carboniferous_umbraculum 9 Feb 2023 1:45 UTC

64 points

8 comments6 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR

Escaque 6611 Jul 2023 17:17 UTC

−1 points

3 comments2 min readLW link

[Linkpost] Play with SAEs on Llama 3

Tom McGrath, Eric Ho and Dan Balsam

25 Sep 2024 22:35 UTC

41 points

2 comments1 min readLW link

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Riya Tyagi, daria, Arthur Conmy and Neel Nanda

13 Jan 2026 20:40 UTC

52 points

0 comments18 min readLW link

AXRP Episode 40 - Jason Gross on Compact Proofs and Interpretability

DanielFilan28 Mar 2025 18:40 UTC

26 points

0 comments89 min readLW link

200 COP in MI: Looking for Circuits in the Wild

Neel Nanda29 Dec 2022 20:59 UTC

16 points

5 comments13 min readLW link

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik Jenner4 Jun 2024 15:50 UTC

121 points

14 comments13 min readLW link

Understanding LLMs: Insights from Mechanistic Interpretability

Stephen McAleese30 Aug 2025 16:50 UTC

45 points

2 comments30 min readLW link

AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilan24 Aug 2024 22:30 UTC

21 points

0 comments74 min readLW link

QAPR 3: interpretability-guided training of neural nets

Quintin Pope28 Sep 2022 16:02 UTC

58 points

2 comments10 min readLW link

Three ways interpretability could be impactful

Arthur Conmy18 Sep 2023 1:02 UTC

47 points

8 comments4 min readLW link

Machinic Psychopharmacology: Do LLMs Self-Medicate?

Sid Black and Joseph Bloom

10 Jun 2026 14:15 UTC

124 points

10 comments23 min readLW link

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC

44 points

4 comments9 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC

56 points

1 comment7 min readLW link

Desiderata for an AI

Nathan Helm-Burger19 Jul 2023 16:18 UTC

9 points

0 comments4 min readLW link

Physics of Language models (part 2.1)

Nathan Helm-Burger19 Sep 2024 16:48 UTC

9 points

2 comments1 min readLW link

(youtu.be)

Sparse Autoencoders for Single-Cell Models

Ihor Kendiukhov12 Apr 2026 16:07 UTC

36 points

2 comments6 min readLW link

What Makes A Good Measurement Device?

johnswentworth24 Aug 2022 22:45 UTC

38 points

7 comments2 min readLW link

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

23 points

2 comments3 min readLW link

200 COP in MI: Techniques, Tooling and Automation

Neel Nanda6 Jan 2023 15:08 UTC

13 points

0 comments15 min readLW link

Mechanistic Interpretability Quickstart Guide

Neel Nanda31 Jan 2023 16:35 UTC

42 points

3 comments6 min readLW link

(www.neelnanda.io)

Neural Networks learn Bloom Filters

Alex Gibson9 May 2026 20:32 UTC

57 points

1 comment12 min readLW link

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

Making sense of parameter-space decomposition

Malmesbury27 Sep 2025 17:37 UTC

54 points

0 comments19 min readLW link

[Linkpost]Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

Curtis Huebner4 May 2023 17:16 UTC

10 points

1 comment1 min readLW link

(arxiv.org)

Mech Interp Challenge: January—Deciphering the Caesar Cipher Model

CallumMcDougall1 Jan 2024 18:03 UTC

17 points

0 comments3 min readLW link

Hedonic Loops and Taming RL

beren19 Jul 2023 15:12 UTC

20 points

14 comments9 min readLW link

Neel Nanda on the Mechanistic Interpretability Researcher Mindset

Michaël Trazzi21 Sep 2023 19:47 UTC

37 points

1 comment3 min readLW link

(theinsideview.ai)

More findings on Memorization and double descent

Marius Hobbhahn1 Feb 2023 18:26 UTC

53 points

2 comments19 min readLW link

New OpenAI Paper—Language models can explain neurons in language models

MrThink10 May 2023 7:46 UTC

47 points

14 comments1 min readLW link

Paper club: He et al. on modular arithmetic (part I)

Dmitry Vaintrob13 Jan 2025 11:18 UTC

14 points

0 comments8 min readLW link

The generalization phase diagram

Dmitry Vaintrob26 Jan 2025 20:30 UTC

28 points

2 comments16 min readLW link

Relaxed adversarial training for inner alignment

evhub10 Sep 2019 23:03 UTC

69 points

27 comments27 min readLW link

Transparency Trichotomy

Mark Xu28 Mar 2021 20:26 UTC

25 points

2 comments7 min readLW link

Open problems in activation engineering

TurnTrout, woog, lisathiergart, Monte M and Ulisse Mini

24 Jul 2023 19:46 UTC

51 points

2 comments1 min readLW link

(coda.io)

How Do Selection Theorems Relate To Interpretability?

johnswentworth9 Jun 2022 19:39 UTC

60 points

14 comments3 min readLW link

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

81 points

10 comments18 min readLW link

SLT for AI Safety

Jesse Hoogland1 Jul 2025 4:52 UTC

78 points

0 comments3 min readLW link

Logits, log-odds, and loss for parallel circuits

Dmitry Vaintrob20 Jan 2025 9:56 UTC

57 points

4 comments11 min readLW link

Progress Report 6: get the tool working

Nathan Helm-Burger10 Jun 2022 11:18 UTC

4 points

0 comments2 min readLW link

How Far Apart Does a Model Think Its Tokens Are?

Brendan Long7 Jun 2026 20:20 UTC

47 points

9 comments10 min readLW link

(www.brendanlong.com)

Explaining the Transformer Circuits Framework by Example

Felix Hofstätter25 Apr 2023 13:45 UTC

9 points

1 comment15 min readLW link

Introducing Leap Labs, an AI interpretability startup

Jessica Rumbelow6 Mar 2023 16:16 UTC

104 points

12 comments1 min readLW link

Neuron Activations to CLIP Embeddings: Geometry of Linear Combinations in Latent Space

Roman Malov3 Feb 2025 10:30 UTC

5 points

0 comments2 min readLW link

Transformer Dynamics: a neuro-inspired approach to MechInterp

guitchounts and jfernando

22 Feb 2025 21:33 UTC

11 points

0 comments5 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

33 points

3 comments19 min readLW link

The conceptual Doppelgänger problem

TsviBT12 Feb 2023 17:23 UTC

19 points

5 comments4 min readLW link

Lurking in the Noise

J Bostock25 Jun 2025 13:36 UTC

37 points

2 comments4 min readLW link

Review of AI Alignment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC

72 points

32 comments7 min readLW link

(bayesianinvestor.com)

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZI5 Mar 2024 13:55 UTC

62 points

24 comments10 min readLW link

(aizi.substack.com)

SAEs Discover Meaningful Features in the IOI Task

Alex Makelov, Georg Lange and Neel Nanda

5 Jun 2024 23:48 UTC

15 points

2 comments10 min readLW link

Geometry of Features in Mechanistic Interpretability

Gunnar Carlsson14 Mar 2025 19:11 UTC

16 points

0 comments8 min readLW link

200 COP in MI: Exploring Polysemanticity and Superposition

Neel Nanda3 Jan 2023 1:52 UTC

34 points

6 comments16 min readLW link

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

21 Sep 2023 15:30 UTC

161 points

8 comments5 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC

128 points

30 comments13 min readLW link

[Question] Inscrutability was always inevitable, right?

Steven Byrnes6 Aug 2025 21:57 UTC

101 points

33 comments2 min readLW link

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Francisco Ferreira da Silva and StefanHex

20 Mar 2026 21:09 UTC

39 points

2 comments6 min readLW link

200 COP in MI: Studying Learned Features in Language Models

Neel Nanda19 Jan 2023 3:48 UTC

24 points

2 comments30 min readLW link

Subsets and quotients in interpretability

Erik Jenner2 Dec 2022 23:13 UTC

26 points

1 comment7 min readLW link

Neural network polytopes (Colab notebook)

Zach Furman21 Apr 2023 22:42 UTC

11 points

0 comments1 min readLW link

(colab.research.google.com)

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

Mechanistic Interpretability Via Learning Differential Equations: AI Safety Camp Project Intermediate Report.

Valentin2026, ayoakin, Eduard Kovalets, tz3r0n4r, Soumyadeep Bose, Utkarsh Priyadarshi, Varun Piram and Axel Ahlqvist

8 May 2025 14:45 UTC

8 points

0 comments7 min readLW link

Mech Interp Challenge: November—Deciphering the Cumulative Sum Model

CallumMcDougall2 Nov 2023 17:10 UTC

18 points

2 comments2 min readLW link

[ASoT] Natural abstractions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC

33 points

1 comment1 min readLW link

(arxiv.org)

Solving the whole AGI control problem, version 0.0001

Steven Byrnes8 Apr 2021 15:14 UTC

64 points

7 comments26 min readLW link

A Walkthrough of In-Context Learning and Induction Heads (w/ Charles Frye) Part 1 of 2

Neel Nanda22 Nov 2022 17:12 UTC

20 points

0 comments1 min readLW link

(www.youtube.com)

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly.

Michael Soareverix8 Sep 2022 15:20 UTC

2 points

2 comments2 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

129 points

9 comments15 min readLW link

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

1 Jun 2020 13:25 UTC

41 points

3 comments7 min readLW link

[Paper] Automated Feature Labeling with Token-Space Gradient Descent

Wuschel Schulz30 Apr 2025 10:22 UTC

4 points

0 comments4 min readLW link

EIS XV: A New Proof of Concept for Useful Interpretability

scasper17 Mar 2025 20:05 UTC

30 points

2 comments3 min readLW link

Interpretability

abergal and Nick_Beckstead

29 Oct 2021 7:28 UTC

61 points

13 comments12 min readLW link

A Research Bet on SAE-like Expert Architectures

Nathan Helm-Burger16 Apr 2026 19:59 UTC

24 points

2 comments2 min readLW link

A Mystery About High Dimensional Concept Encoding

Fabien Roger3 Nov 2022 17:05 UTC

46 points

13 comments7 min readLW link

How ARENA course material gets made

CallumMcDougall2 Jul 2024 18:04 UTC

41 points

2 comments7 min readLW link

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC

20 points

1 comment1 min readLW link

(christophm.github.io)

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and fbarez

3 May 2024 1:18 UTC

48 points

6 comments1 min readLW link

Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalled13 Sep 2024 17:09 UTC

24 points

40 comments1 min readLW link

Reinforcement learning scaling might incentivise hidden reasoning architectures for AI

Oliver Sourbut10 May 2026 15:30 UTC

19 points

5 comments6 min readLW link

(www.oliversourbut.net)

Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap

Nathan Helm-Burger23 Sep 2021 0:38 UTC

21 points

2 comments12 min readLW link

AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu

DanielFilan27 Jul 2023 1:50 UTC

22 points

0 comments72 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Adam Kaufman

9 Dec 2023 2:27 UTC

70 points

5 comments10 min readLW link

One Way to Think About ML Transparency

Matthew Barnett2 Sep 2019 23:27 UTC

26 points

28 comments5 min readLW link

Paper: Transformers learn in-context by gradient descent

LawrenceC16 Dec 2022 11:10 UTC

28 points

11 comments2 min readLW link

(arxiv.org)

Interpreting OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC

116 points

13 comments7 min readLW link

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

162 points

2 comments2 min readLW link

(www.goodfire.ai)

Video/animation: Neel Nanda explains what mechanistic interpretability is

DanielFilan22 Feb 2023 22:42 UTC

24 points

7 comments1 min readLW link

(youtu.be)

Paper: Superposition, Memorization, and Double Descent (Anthropic)

LawrenceC5 Jan 2023 17:54 UTC

53 points

11 comments1 min readLW link

(transformer-circuits.pub)

Assessment of AI safety agendas: think about the downside risk

Roman Leventov19 Dec 2023 9:00 UTC

13 points

1 comment1 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

80 points

10 comments8 min readLW link

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Arthur Conmy and Neel Nanda

24 Feb 2025 2:17 UTC

48 points

1 comment7 min readLW link

[Question] Previous Work on Recreating Neural Network Input from Intermediate Layer Activations

bglass12 Oct 2022 19:28 UTC

1 point

3 comments1 min readLW link

How Do Induction Heads Actually Work in Transformers With Finite Capacity?

Fabien Roger23 Mar 2023 9:09 UTC

28 points

0 comments5 min readLW link

Really Strong Features Found in Residual Stream

Logan Riggs8 Jul 2023 19:40 UTC

69 points

6 comments2 min readLW link

Precursor checking for deceptive alignment

evhub3 Aug 2022 22:56 UTC

24 points

0 comments14 min readLW link

Comments on OpenPhil’s Interpretability RFP

paulfchristiano5 Nov 2021 22:36 UTC

91 points

5 comments7 min readLW link

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

19 Dec 2024 15:59 UTC

46 points

6 comments11 min readLW link

Downstream applications as validation of interpretability progress

Sam Marks31 Mar 2025 1:35 UTC

112 points

3 comments7 min readLW link

[Question] Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Evan R. Murphy26 May 2025 18:20 UTC

43 points

6 comments1 min readLW link

Paper Replication Walkthrough: Reverse-Engineering Modular Addition

Neel Nanda12 Mar 2023 13:25 UTC

18 points

0 comments1 min readLW link

(neelnanda.io)

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

LawrenceC2 Feb 2026 18:41 UTC

18 points

0 comments5 min readLW link

In (highly contingent!) defense of interpretability-in-the-loop ML training

Steven Byrnes6 Feb 2026 16:32 UTC

85 points

11 comments3 min readLW link

(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda18 Oct 2022 21:08 UTC

72 points

5 comments12 min readLW link

(www.neelnanda.io)

Durkon, an open-source tool for Inherently Interpretable Modelling

abstractapplic24 Dec 2022 1:49 UTC

47 points

0 comments4 min readLW link

Glitch Token Catalog - (Almost) a Full Clear

Lao Mein21 Sep 2024 12:22 UTC

38 points

3 comments37 min readLW link

Results from the Turing Seminar hackathon

Charbel-Raphaël, jeanne_ and Léo Dana

7 Dec 2023 14:50 UTC

35 points

1 comment5 min readLW link

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth8 Aug 2022 18:05 UTC

150 points

13 comments3 min readLW link

Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo

Neel Nanda16 Jul 2023 22:02 UTC

67 points

15 comments1 min readLW link

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition

cmathw, Dennis Akar and Lee Sharkey

8 Apr 2024 11:14 UTC

42 points

4 comments15 min readLW link

Sparse trinary weighted RNNs as a path to better language model interpretability

Am8ryllis17 Sep 2022 19:48 UTC

19 points

13 comments3 min readLW link

Neuronpedia

Johnny Lin26 Jul 2023 16:29 UTC

135 points

51 comments2 min readLW link

(neuronpedia.org)

Use more text than one token to avoid neuralese

Jude Stiel13 Feb 2026 21:09 UTC

10 points

4 comments1 min readLW link

Cross-Layer Feature Alignment and Steering in Large Language Model

dlaptev8 Feb 2025 20:18 UTC

9 points

0 comments6 min readLW link

Solving Interpretability Week

Logan Riggs13 Dec 2021 17:09 UTC

11 points

5 comments1 min readLW link

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Adam Morris and Dillon Plunkett

14 Nov 2025 0:18 UTC

12 points

0 comments4 min readLW link

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC

9 points

1 comment4 min readLW link

“The Urgency of Interpretability” (Dario Amodei)

RobertM27 Apr 2025 4:31 UTC

31 points

23 comments3 min readLW link

(www.darioamodei.com)

Circuits in Superposition 2: Now with Less Wrong Math

Linda Linsefors and Lucius Bushnaq

30 Jun 2025 10:25 UTC

73 points

0 comments22 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

20 Feb 2023 19:35 UTC

96 points

8 comments21 min readLW link

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix

Jaehyuk Lim, Kanishk Tantia and Sinem

11 Oct 2024 23:06 UTC

8 points

2 comments10 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

On the functional self of LLMs

eggsyntax7 Jul 2025 15:39 UTC

124 points

38 comments8 min readLW link

Activation additions in a small residual network

Garrett Baker22 May 2023 20:28 UTC

22 points

4 comments3 min readLW link

How can Interpretability help Alignment?

RobertKirk and Tomáš Gavenčiak

23 May 2020 16:16 UTC

37 points

3 comments9 min readLW link

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper21 May 2024 20:15 UTC

157 points

16 comments3 min readLW link

Extracting Performant Algorithms Using Mechanistic Interpretability

Ihor Kendiukhov14 Mar 2026 14:19 UTC

56 points

7 comments7 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

116 points

10 comments12 min readLW link

Circuits in Superposition: Compressing many small neural networks into one

Lucius Bushnaq and jake_mendel

14 Oct 2024 13:06 UTC

131 points

9 comments13 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

6 Dec 2024 22:19 UTC

179 points

16 comments11 min readLW link 1 review

(arxiv.org)

Language Models are a Potentially Safe Path to Human-Level AGI

Nadav Brandes20 Apr 2023 0:40 UTC

28 points

7 comments8 min readLW link 1 review

Some OthelloGPT Circuits

Alfred Wong15 Apr 2025 18:41 UTC

7 points

0 comments7 min readLW link

Mech Interp Challenge: October—Deciphering the Sorted List Model

CallumMcDougall3 Oct 2023 10:57 UTC

23 points

0 comments3 min readLW link

Garrabrant and Shah on human modeling in AGI

Rob Bensinger4 Aug 2021 4:35 UTC

60 points

10 comments47 min readLW link

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David Udell23 Sep 2023 19:16 UTC

42 points

7 comments34 min readLW link

Automating LLM Auditing with Developmental Interpretability

htlou and evhub

4 Sep 2024 15:50 UTC

19 points

0 comments3 min readLW link

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, Tom McGrath and Neel Nanda

13 Oct 2023 18:32 UTC

82 points

4 comments8 min readLW link

Multi-dimensional rewards for AGI interpretability and control

Steven Byrnes4 Jan 2021 3:08 UTC

19 points

8 comments10 min readLW link

How To Become A Mechanistic Interpretability Researcher

Neel Nanda2 Sep 2025 23:38 UTC

146 points

12 comments55 min readLW link

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

likenneth11 Jun 2023 5:38 UTC

195 points

4 comments1 min readLW link

(arxiv.org)

Let’s buy out Cyc, for use in AGI interpretability systems?

Steven Byrnes7 Dec 2021 20:46 UTC

50 points

10 comments2 min readLW link

Othello-GPT: Reflections on the Research Process

Neel Nanda29 Mar 2023 22:13 UTC

38 points

0 comments15 min readLW link

(neelnanda.io)

200 COP in MI: Image Model Interpretability

Neel Nanda8 Jan 2023 14:53 UTC

18 points

3 comments6 min readLW link

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Neel Nanda1 Nov 2022 23:56 UTC

69 points

16 comments1 min readLW link

(youtu.be)

The case for becoming a black-box investigator of language models

Buck6 May 2022 14:35 UTC

126 points

20 comments3 min readLW link

Dropout can create a privileged basis in the ReLU output model.

lewis smith28 Apr 2023 1:59 UTC

24 points

3 comments5 min readLW link

Othello-GPT: Future Work I Am Excited About

Neel Nanda29 Mar 2023 22:13 UTC

48 points

2 comments33 min readLW link

(neelnanda.io)

AXRP Episode 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

DanielFilan20 Jan 2025 0:40 UTC

9 points

0 comments16 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett Baker26 Aug 2022 18:26 UTC

120 points

48 comments1 min readLW link

Why I stopped being into basin broadness

tailcalled25 Apr 2024 20:47 UTC

16 points

3 comments2 min readLW link

Investigating encoded reasoning in LLMs

lbernick, Kat Dearstyne and jameselmore

9 Mar 2026 21:05 UTC

11 points

0 comments6 min readLW link

AXRP Episode 41 - Lee Sharkey on Attribution-based Parameter Decomposition

DanielFilan3 Jun 2025 3:40 UTC

28 points

1 comment61 min readLW link

Addendum: basic facts about language models during training

beren6 Mar 2023 19:24 UTC

22 points

2 comments5 min readLW link

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob Bensinger5 Mar 2021 23:43 UTC

145 points

13 comments26 min readLW link

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer and lisathiergart

31 Mar 2023 19:20 UTC

101 points

17 comments11 min readLW link

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

13 Oct 2023 1:38 UTC

70 points

0 comments1 min readLW link

(arxiv.org)

Mech Interp Project Advising Call: Memorisation in GPT-2 Small

Neel Nanda4 Feb 2023 14:17 UTC

7 points

0 comments1 min readLW link

Can Reasoning Models Avoid the Most Forbidden Technique?

Brendan Long17 May 2025 23:26 UTC

9 points

9 comments3 min readLW link

(www.brendanlong.com)

Thoughts on Toy Models of Superposition

CorrigibleAgent2 Feb 2025 13:52 UTC

5 points

2 comments9 min readLW link

Here’s 18 Applications of Deception Probes

Cleo Nardo, Avi Parrack and jordinne

28 Aug 2025 18:59 UTC

45 points

0 comments22 min readLW link

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics

DanielFilan29 Sep 2024 5:50 UTC

26 points

0 comments55 min readLW link

How useful is mechanistic interpretability?

ryan_greenblatt, Neel Nanda, Buck and habryka

1 Dec 2023 2:54 UTC

177 points

57 comments25 min readLW link

Stagewise Development in Neural Networks

Jesse Hoogland, Liam Carroll and Daniel Murfet

20 Mar 2024 19:54 UTC

96 points

1 comment11 min readLW link

The Translucent Thoughts Hypotheses and Their Implications

Fabien Roger9 Mar 2023 16:30 UTC

142 points

7 comments19 min readLW link

SHIFT relies on token-level features to de-bias Bias in Bios probes

Tim Hua19 Mar 2025 21:29 UTC

39 points

2 comments6 min readLW link

AGI-Automated Interpretability is Suicide

__RicG__10 May 2023 14:20 UTC

27 points

33 comments7 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

8 Dec 2023 17:08 UTC

83 points

7 comments7 min readLW link

Knowledge Neurons in Pretrained Transformers

evhub17 May 2021 22:54 UTC

100 points

7 comments2 min readLW link

(arxiv.org)

AXRP Episode 38.2 - Jesse Hoogland on Singular Learning Theory

DanielFilan27 Nov 2024 6:30 UTC

34 points

0 comments10 min readLW link

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex8 Aug 2024 18:33 UTC

166 points

11 comments8 min readLW link

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda7 Nov 2022 22:39 UTC

30 points

15 comments3 min readLW link

(youtu.be)

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran and Neel Nanda

13 Apr 2023 11:59 UTC

18 points

0 comments8 min readLW link

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind and Demian Till

30 Dec 2024 22:50 UTC

24 points

3 comments15 min readLW link

You’re Measuring Model Complexity Wrong

Jesse Hoogland and Stan van Wingerden

11 Oct 2023 11:46 UTC

94 points

17 comments13 min readLW link

LLM Social Autopilot

arhngl12 Feb 2026 18:59 UTC

1 point

0 comments10 min readLW link

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper12 Sep 2022 19:07 UTC

97 points

7 comments2 min readLW link

(arxiv.org)

Measuring Structure Development in Algorithmic Transformers

Micurie and Einar Urdshals

22 Aug 2024 8:38 UTC

56 points

4 comments11 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

139 points

39 comments27 min readLW link

Growing Bonsai Networks with RNNs

ameo7 Aug 2023 17:34 UTC

21 points

5 comments1 min readLW link

(cprimozic.net)

Literature Review of Text AutoEncoders

NickyP19 Feb 2025 21:54 UTC

22 points

5 comments8 min readLW link

Toy Models of Superposition

evhub21 Sep 2022 23:48 UTC

69 points

4 comments5 min readLW link 1 review

(transformer-circuits.pub)

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

16 Aug 2022 2:09 UTC

21 points

2 comments16 min readLW link

Mech Interp Challenge: September—Deciphering the Addition Model

CallumMcDougall13 Sep 2023 22:23 UTC

35 points

0 comments4 min readLW link

Simplex Progress Report—July 2025

Adam Shai, Paul Riechers, hrbigelow, Eric Alt and mntss

28 Jul 2025 21:58 UTC

112 points

3 comments15 min readLW link

How does GPT-3 spend its 175B parameters?

Robert_AIZI13 Jan 2023 19:21 UTC

41 points

14 comments6 min readLW link

(aizi.substack.com)

Looking for backdoors in Jane Street LLMs

Cipolla23 May 2026 0:06 UTC

16 points

0 comments14 min readLW link

A Walkthrough of A Mathematical Framework for Transformer Circuits

Neel Nanda25 Oct 2022 20:24 UTC

52 points

7 comments1 min readLW link

(www.youtube.com)

Wittgenstein and ML — parameters vs architecture

Cleo Nardo24 Mar 2023 4:54 UTC

44 points

9 comments5 min readLW link

Inner Alignment in Salt-Starved Rats

Steven Byrnes19 Nov 2020 2:40 UTC

137 points

41 comments11 min readLW link 2 reviews

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

55 points

10 comments1 min readLW link

(storage.googleapis.com)

A Selection of Randomly Selected SAE Features

CallumMcDougall and Joseph Bloom

1 Apr 2024 9:09 UTC

109 points

2 comments4 min readLW link

[Question] How optimistic should we be about AI figuring out how to interpret itself?

oh5432125 Jul 2022 22:09 UTC

3 points

1 comment1 min readLW link

AI Transparency: Why it’s critical and how to obtain it.

Zohar Jackson14 Aug 2022 10:31 UTC

6 points

1 comment5 min readLW link

Exploring Concept-Specific Slices in Weight Matrices for Network Interpretability

DuncanFowler9 Jun 2023 16:39 UTC

1 point

0 comments6 min readLW link

An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Jessica Rumbelow16 May 2026 3:58 UTC

69 points

7 comments11 min readLW link

(www.leap-labs.com)

Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search”

RobertM14 Sep 2023 2:18 UTC

88 points

4 comments8 min readLW link

Search versus design

Alex Flint16 Aug 2020 16:53 UTC

110 points

40 comments36 min readLW link 1 review

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link

[Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation

Steven Byrnes23 Mar 2022 12:48 UTC

56 points

17 comments21 min readLW link

AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda

DanielFilan4 Feb 2023 3:00 UTC

45 points

0 comments117 min readLW link

Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall29 Nov 2023 12:56 UTC

83 points

9 comments4 min readLW link

Superposition is not “just” neuron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC

71 points

4 comments13 min readLW link

Steelmanning heuristic arguments

Dmitry Vaintrob13 Apr 2025 1:09 UTC

77 points

1 comment17 min readLW link

Rotations in Superposition

Linda Linsefors and Lucius Bushnaq

15 Dec 2025 14:58 UTC

54 points

6 comments11 min readLW link

To be legible, evidence of misalignment probably has to be behavioral

ryan_greenblatt15 Apr 2025 18:14 UTC

58 points

19 comments3 min readLW link

[ASoT] Policy Trajectory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC

9 points

2 comments1 min readLW link

Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy

Neel Nanda29 Aug 2023 22:07 UTC

36 points

1 comment1 min readLW link

(www.youtube.com)

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

johnswentworth4 Jun 2022 5:41 UTC

170 points

56 comments2 min readLW link 1 review

200 COP in MI: The Case for Analysing Toy Language Models

Neel Nanda28 Dec 2022 21:07 UTC

40 points

3 comments7 min readLW link

You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them

RobinHa10 Jun 2026 15:21 UTC

62 points

5 comments9 min readLW link

(robinhaselhorst.com)

An Ambitious Vision for Interpretability

leogao5 Dec 2025 22:57 UTC

175 points

8 comments4 min readLW link

SAE-VIS: Announcement Post

CallumMcDougall and Joseph Bloom

31 Mar 2024 15:30 UTC

74 points

8 comments1 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

335 points

28 comments23 min readLW link

NLA Thought Anchors

Realmbird31 May 2026 23:38 UTC

10 points

3 comments4 min readLW link

More Recent Progress in the Theory of Neural Networks

jylin046 Oct 2022 16:57 UTC

82 points

6 comments4 min readLW link

A Barebones Guide to Mechanistic Interpretability Prerequisites

Neel Nanda24 Oct 2022 20:45 UTC

64 points

12 comments3 min readLW link

(neelnanda.io)

Finding gliders in the game of life

paulfchristiano1 Dec 2022 20:40 UTC

104 points

8 comments16 min readLW link

(ai-alignment.com)

Mechanism for feature learning in neural networks and backpropagation-free machine learning models

Trinley Goldenberg19 Mar 2024 14:55 UTC

8 points

1 comment1 min readLW link

(www.science.org)

Scientific Discovery in the Age of Artificial Intelligence

Jessica Rumbelow29 Jun 2025 20:45 UTC

42 points

3 comments10 min readLW link

Investigating task-specific prompts and sparse autoencoders for activation monitoring

Henk Tillman30 Apr 2025 17:09 UTC

23 points

0 comments1 min readLW link

(arxiv.org)

Training a Transformer to Compose One Step Per Layer (and Proving It)

Brendan Long26 Apr 2026 23:45 UTC

17 points

0 comments6 min readLW link

(www.brendanlong.com)

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

1 Jul 2024 21:35 UTC

75 points

12 comments9 min readLW link

Ping pong computation in superposition

Alex Gibson29 Dec 2025 16:31 UTC

13 points

0 comments3 min readLW link

Conditional Importance in Toy Models of Superposition

CorrigibleAgent2 Feb 2025 20:35 UTC

9 points

4 comments10 min readLW link

200 COP in MI: Interpreting Reinforcement Learning

Neel Nanda10 Jan 2023 17:37 UTC

25 points

1 comment10 min readLW link

On polytopes

Dmitry Vaintrob25 Jan 2025 13:56 UTC

56 points

5 comments12 min readLW link

“Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger31 Oct 2022 21:26 UTC

51 points

25 comments2 min readLW link

Verification and Transparency

DanielFilan8 Aug 2019 1:50 UTC

35 points

6 comments2 min readLW link

(danielfilan.com)

The memorization-generalization spectrum and learning coefficients

Dmitry Vaintrob28 Jan 2025 16:53 UTC

17 points

0 comments10 min readLW link

Exciting New Interpretability Paper!

research_prime_space9 May 2023 16:39 UTC

12 points

1 comment1 min readLW link

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

David Udell, hrdkbhatnagar and JacksonKaunismaa

22 Jul 2025 20:36 UTC

23 points

0 comments6 min readLW link

Mech Interp Puzzle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC

41 points

4 comments2 min readLW link

AlgZoo: uninterpreted models with fewer than 1,500 parameters

Jacob_Hilton26 Jan 2026 17:30 UTC

181 points

7 comments10 min readLW link

(www.alignment.org)

200 COP in MI: Analysing Training Dynamics

Neel Nanda4 Jan 2023 16:08 UTC

16 points

0 comments14 min readLW link

Trusted monitoring, but with deception probes.

Avi Parrack, StefanHex and Cleo Nardo

23 Jul 2025 5:26 UTC

31 points

0 comments4 min readLW link

(arxiv.org)

QFT and neural nets: the basic idea

Dmitry Vaintrob24 Jan 2025 13:54 UTC

28 points

0 comments8 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

22 May 2024 8:55 UTC

51 points

0 comments1 min readLW link

(humanaligned.ai)

An Analytic Perspective on AI Alignment

DanielFilan1 Mar 2020 4:10 UTC

54 points

45 comments8 min readLW link

(danielfilan.com)

Analogies between Software Reverse Engineering and Mechanistic Interpretability

Neel Nanda and Itay Yona

26 Dec 2022 12:26 UTC

34 points

6 comments11 min readLW link

(www.neelnanda.io)

Mech interp is not pre-paradigmatic

Lee Sharkey10 Jun 2025 13:39 UTC

213 points

15 comments13 min readLW link

Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds

1a3orn4 Apr 2023 17:39 UTC

214 points

38 comments5 min readLW link 1 review

Mechanistic Interpretability of Biological Foundation Models

Ihor Kendiukhov20 Feb 2026 18:01 UTC

34 points

1 comment26 min readLW link

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Neel Nanda25 Dec 2022 22:21 UTC

58 points

7 comments12 min readLW link

(www.neelnanda.io)

Training on Documents About Monitoring Leads To CoT Obfuscation

Reilly Haskins, bilalchughtai and Josh Engels

18 Mar 2026 20:37 UTC

65 points

5 comments16 min readLW link

Reproducing steering against evaluation awareness in a large open-weight model

Thomas Read, Bronson Schoen, Santiago Aranguri and Joseph Bloom

10 Apr 2026 10:45 UTC

89 points

17 comments15 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC

17 points

3 comments1 min readLW link

SAEs you can See: Applying Sparse Autoencoders to Clustering

Robert_AIZI28 Oct 2024 14:48 UTC

27 points

0 comments10 min readLW link

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Neel Nanda, János Kramár, Tom Lieberum and Rohin Shah

18 Mar 2024 17:28 UTC

19 points

0 comments1 min readLW link

(arxiv.org)

More findings on maximal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC

27 points

1 comment11 min readLW link

Shapley Value Attribution in Chain of Thought

leogao14 Apr 2023 5:56 UTC

106 points

7 comments4 min readLW link

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

153 points

15 comments13 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

63 points

38 comments1 min readLW link

(arxiv.org)

Implementing activation steering

Annah5 Feb 2024 17:51 UTC

76 points

8 comments7 min readLW link

Attribution Patching: Activation Patching At Industrial Scale

Neel Nanda16 Mar 2023 21:44 UTC

45 points

10 comments58 min readLW link

(www.neelnanda.io)

Exploring SAE features in LLMs with definition trees and token lists

mwatkins4 Oct 2024 22:15 UTC

46 points

5 comments6 min readLW link

Tiny Mech Interp Projects: Emergent Positional Embeddings of Words

Neel Nanda18 Jul 2023 21:24 UTC

52 points

1 comment9 min readLW link

Visualizing Neural networks, how to blame the bias

Donald Hobson9 Jul 2022 15:52 UTC

7 points

1 comment6 min readLW link

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas and Owain_Evans

18 Dec 2025 20:21 UTC

154 points

11 comments8 min readLW link

(arxiv.org)

Dmitry’s Koan

Dmitry Vaintrob10 Jan 2025 4:27 UTC

44 points

8 comments22 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

73 points

0 comments3 min readLW link

[Question] Transformer Mech Interp: Any visualizations?

Joyee Chen18 Jan 2023 4:32 UTC

3 points

0 comments1 min readLW link

Polysemanticity and Capacity in Neural Networks

Buck, Adam Jermyn and Kshitij Sachan

7 Oct 2022 17:51 UTC

87 points

14 comments3 min readLW link

Beyond Gaussian: Language Model Representations and Distributions

Matt Levinson24 Nov 2024 1:53 UTC

6 points

1 comment5 min readLW link

[Research Update] Sparse Autoencoder features are bimodal

Robert_AIZI22 Jun 2023 13:15 UTC

24 points

1 comment5 min readLW link

(aizi.substack.com)

Pro or Average Joe? Do models infer our technical ability and can we control this judgement?

tobypullan12 Jan 2026 20:52 UTC

12 points

0 comments9 min readLW link

SimpleStories: A Better Synthetic Dataset and Tiny Models for Interpretability

Lennart Finke3 May 2025 14:04 UTC

16 points

0 comments1 min readLW link

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate

Jordan Taylor, Max H, Ed Fage, Thomas Read and Joseph Bloom

21 May 2026 14:52 UTC

83 points

0 comments6 min readLW link

(www.aisi.gov.uk)

Inside the mind of a superhuman Go model: How does Leela Zero read ladders?

Haoxing Du1 Mar 2023 1:47 UTC

159 points

8 comments30 min readLW link

Towards Developmental Interpretability

Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet and Stan van Wingerden

12 Jul 2023 19:33 UTC

195 points

10 comments9 min readLW link 1 review

A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC

26 points

9 comments6 min readLW link

Bird-eye view visualization of LLM activations

Sergii8 Oct 2023 12:12 UTC

11 points

2 comments1 min readLW link

(grgv.xyz)

Hessian analysis with JAX: a platform-agnostic, high-performance approach

ayush bharadwaj5 Aug 2025 0:25 UTC

9 points

0 comments10 min readLW link

Speculations against GPT-n writing alignment papers

Donald Hobson7 Jun 2021 21:13 UTC

31 points

6 comments2 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

117 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

How a failed experiment broke (and fixed) my view on feature labels

enricobottazzi29 May 2026 0:24 UTC

17 points

2 comments10 min readLW link

What are polysemantic neurons?

Vishakha and Algon

8 Jan 2025 7:35 UTC

9 points

0 comments4 min readLW link

(aisafety.info)

Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

Remmelt19 Dec 2022 12:02 UTC

−3 points

9 comments31 min readLW link

LLM Hallucinations: An Internal Tug of War

violazhong30 Oct 2025 1:21 UTC

9 points

0 comments3 min readLW link

[linkpost] Acquisition of Chess Knowledge in AlphaZero

Quintin Pope23 Nov 2021 7:55 UTC

8 points

1 comment1 min readLW link

Observation of Structural Reasoning Stability in Long-term Human–LLM Interaction

Hiromi Shimamoto18 Feb 2026 2:48 UTC

1 point

0 comments1 min readLW link

Design-First Embedding Construction: Semantic Spaces Without Corpus Training

MASATO AMANO10 Mar 2026 8:24 UTC

1 point

0 comments1 min readLW link

Should You Make Stone Tools?

Supermatrix-AI29 Aug 2025 3:10 UTC

0 points

0 comments3 min readLW link

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang, Alexandre Variengien, Arthur Conmy, Buck and jsteinhardt

28 Oct 2022 23:55 UTC

101 points

9 comments9 min readLW link 2 reviews

(arxiv.org)

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

1 Jan 2026 16:37 UTC

24 points

0 comments23 min readLW link

Deconfusing “Capabilities vs. Alignment”

RobertM23 Jan 2023 4:46 UTC

28 points

7 comments2 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

438 points

103 comments12 min readLW link 1 review

EIS II: What is “Interpretability”?

scasper9 Feb 2023 16:48 UTC

28 points

6 comments4 min readLW link

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

28 May 2024 5:29 UTC

53 points

1 comment9 min readLW link

(arxiv.org)

A technical note on bilinear layers for interpretability

Lee Sharkey8 May 2023 6:06 UTC

59 points

0 comments1 min readLW link

(arxiv.org)

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

8 comments8 min readLW link

Relationships among words, metalingual definition, and interpretability

Bill Benzon7 Jun 2024 19:18 UTC

2 points

0 comments5 min readLW link

Learning to Interpret Weight Differences in Language Models

avichal23 Oct 2025 3:55 UTC

90 points

3 comments5 min readLW link

(arxiv.org)

A Black Box Made Less Opaque (part 2)

Matthew McDonnell4 Feb 2026 4:12 UTC

6 points

0 comments15 min readLW link

Polysemanticity is a Misnomer

Shiva's Right Foot12 Feb 2026 17:22 UTC

11 points

0 comments3 min readLW link

Can Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz, Greg Kocher and Tim Hua

8 Nov 2025 18:26 UTC

26 points

10 comments8 min readLW link

Neural networks generalize because of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC

215 points

35 comments15 min readLW link 1 review

(www.jessehoogland.com)

A conversation with Anima Labs, part I: Phenomenology of digital minds

cube_flipper and Antra Tessera

7 Apr 2026 19:19 UTC

37 points

2 comments37 min readLW link

(smoothbrains.net)

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham23 Feb 2025 5:46 UTC

31 points

8 comments8 min readLW link

Can a semantic compression kernel like WFGY improve LLM alignment and institutional robustness?

onestardao18 Jul 2025 2:56 UTC

1 point

0 comments1 min readLW link

Selective regularization for alignment-focused representation engineering

Sandy Fraser20 May 2025 12:54 UTC

22 points

3 comments11 min readLW link

Hard-Coding Neural Computation

MadHatter13 Dec 2021 4:35 UTC

34 points

8 comments27 min readLW link

Deliberative Credit Assignment (DCA): Making Faithful Reasoning Profitable

Florian_Dietz29 Jul 2025 16:23 UTC

9 points

0 comments17 min readLW link

The risk-reward tradeoff of interpretability research

JustinShovelain and Elliot Mckernon

5 Jul 2023 17:05 UTC

17 points

1 comment6 min readLW link

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Panickssery

14 Aug 2023 17:04 UTC

87 points

4 comments1 min readLW link

Interpretability of SAE Features Representing Check in ChessGPT

Jonathan Kutasov5 Oct 2024 20:43 UTC

27 points

2 comments8 min readLW link

Anthropic’s SoLU (Softmax Linear Unit)

Joel Burget4 Jul 2022 18:38 UTC

21 points

1 comment4 min readLW link

(transformer-circuits.pub)

AI psychology should ground the theories of AI consciousness and inform human-AI ethical interaction design

Roman Leventov8 Jan 2023 6:37 UTC

20 points

8 comments2 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

8 Sep 2023 14:39 UTC

102 points

11 comments3 min readLW link

(arxiv.org)

Propuesta de Arquitectura: Supervisión Simétrica para el Control de Lenguajes Emergentes (S.S.A.)

Manuu12 Feb 2026 19:20 UTC

1 point

0 comments1 min readLW link

Try Training SAEs with RLAIF

Léo Dana5 Dec 2025 1:10 UTC

5 points

0 comments2 min readLW link

Mechanistically interpreting time in GPT-2 small

rgould, Elizabeth Ho and Arthur Conmy

16 Apr 2023 17:57 UTC

68 points

6 comments21 min readLW link

Observations on self-supervised Learning for vision

Dinkar Juyal10 Mar 2025 19:31 UTC

3 points

0 comments5 min readLW link

Reporting an LLM jailbreak that inversely scales with capability

Ahmed Amer5 Dec 2025 2:13 UTC

1 point

0 comments4 min readLW link

Open Source Automated Interpretability for Sparse Autoencoder Features

kh4dien, SrGonao, jacob_drori and Nora Belrose

30 Jul 2024 21:11 UTC

67 points

1 comment13 min readLW link

(blog.eleuther.ai)

Tools Of The Trade

Max Fomin23 Feb 2026 23:44 UTC

1 point

0 comments1 min readLW link

(labs.zenity.io)

Domain-specific SAEs

jacob_drori7 Oct 2024 20:15 UTC

28 points

2 comments5 min readLW link

Towards an Ethics Calculator for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC

3 points

2 comments11 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

78 points

4 comments8 min readLW link

Why Attack Success Rate Gives a False Picture of Backdoor Removal

Geoffrey Voyer24 Feb 2026 20:02 UTC

3 points

0 comments12 min readLW link

Can we efficiently distinguish different mechanisms?

paulfchristiano27 Dec 2022 0:20 UTC

91 points

30 comments16 min readLW link

(ai-alignment.com)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

25 Sep 2024 20:37 UTC

30 points

0 comments3 min readLW link

(arxiv.org)

NOVA Stage 0: Can Safety Be Structural? A Mechanism Proof at 307M Parameters

Faaz Mohamed6 Jun 2026 0:43 UTC

1 point

0 comments15 min readLW link

Probes decode what’s present, not what’s used: the readout–mediator angle

Shreyas Fadnavis2 Jun 2026 2:55 UTC

1 point

0 comments8 min readLW link

Teaser: Hard-coding Transformer Models

MadHatter12 Dec 2021 22:04 UTC

74 points

19 comments1 min readLW link

EIOC: A Framework for Human-AI Trust, Agency, and Collaborative Intelligence

Narnaiezzsshaa25 Dec 2025 3:31 UTC

1 point

0 comments4 min readLW link

Seeking Feedback on My Mechanistic Interpretability Research Agenda

RGRGRG12 Sep 2023 18:45 UTC

5 points

1 comment3 min readLW link

Pando: A Controlled Benchmark for Interpretability Methods

Ziqian Zhong21 Apr 2026 21:40 UTC

6 points

0 comments3 min readLW link

(arxiv.org)

The PEST–NST Unified Framework: A Dual Stress-Test for Coherence and Evolutionary Intelligence

PratikS9123 Nov 2025 12:51 UTC

1 point

0 comments3 min readLW link

Toy Models of Feature Absorption in SAEs

chanind, hrdkbhatnagar, TomasD and Joseph Bloom

7 Oct 2024 9:56 UTC

49 points

8 comments10 min readLW link

[Theory-Fiction] The Sovereign Gaze Hypothesis: Deceptive Alignment at the Thermodynamic Limit

kpg_diagnostic29 Apr 2026 10:51 UTC

1 point

0 comments1 min readLW link

Eliciting Latent Knowledge in Comprehensive AI Services Models

acabodi17 Nov 2023 2:36 UTC

6 points

0 comments5 min readLW link

[Question] Geometric Dynamics of LLMs: Intent as a Gauge Field?

Wei-Zhuo Zhang13 Mar 2026 11:17 UTC

1 point

0 comments3 min readLW link

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict

mfatt4 Dec 2025 18:27 UTC

9 points

0 comments9 min readLW link

Introspection or confusion?

Victor Godet9 Nov 2025 20:53 UTC

43 points

3 comments4 min readLW link

Can quantised autoencoders find and interpret circuits in language models?

charlieoneill24 Mar 2024 20:05 UTC

30 points

4 comments24 min readLW link

Exploring how OthelloGPT computes its world model

JMaar2 Feb 2025 21:29 UTC

8 points

0 comments8 min readLW link

What I am working on right now and why: representation engineering edition

Lukasz G Bartoszcze18 Mar 2025 22:37 UTC

3 points

0 comments3 min readLW link

[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo

Bogdan Ionut Cirstea17 Jun 2023 16:38 UTC

12 points

0 comments1 min readLW link

# Emotion Is Structure: Toward Recursive Alignment Through Human–AI Co-Creation

thesignalthatcouldntbeheard3 Aug 2025 5:19 UTC

1 point

0 comments3 min readLW link

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

12 Oct 2024 14:54 UTC

29 points

4 comments7 min readLW link

There is a globe in your LLM

jacob_drori8 Oct 2024 0:43 UTC

91 points

4 comments1 min readLW link

Exploring the Evolution and Migration of Different Layer Embedding in LLMs

Ruixuan Huang8 Mar 2024 15:01 UTC

6 points

0 comments8 min readLW link

Des: A Case Study in Emergent Symbolic Continuity in GPT-4o

TallulahMerrall19 May 2025 10:10 UTC

1 point

0 comments5 min readLW link

The Quantization Model of Neural Scaling

nz31 Mar 2023 16:02 UTC

17 points

0 comments1 min readLW link

(arxiv.org)

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

72 points

4 comments21 min readLW link

Visualize Cyclical Structure in Llama Model

Talib Mirza31 May 2026 8:27 UTC

3 points

0 comments2 min readLW link

Activation adding experiments with llama-7b

Nina Panickssery16 Jul 2023 4:17 UTC

51 points

1 comment3 min readLW link

Informal semantics and Orders

Q Home27 Aug 2022 4:17 UTC

14 points

10 comments26 min readLW link

ATTENTION GATHERS, MLPS COMPOSE: A CAUSAL ANALYSIS OF AN ACTION-OUTCOME CIRCUIT IN VIDEOVIT

Sai Chereddy11 Oct 2025 10:27 UTC

1 point

0 comments7 min readLW link

We Found An Neuron in GPT-2

Joseph Miller and Clement Neo

11 Feb 2023 18:27 UTC

143 points

23 comments7 min readLW link

(clementneo.com)

The role of philosophical thinking in understanding large language models: Calibrating and closing the gap between first-person experience and underlying mechanisms

Bill Benzon23 Feb 2024 12:19 UTC

4 points

0 comments10 min readLW link

Developmental Stages in Multi-Problem Grokking

James Sullivan29 Sep 2024 18:58 UTC

4 points

0 comments6 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex and Marius Hobbhahn

9 May 2023 19:41 UTC

119 points

1 comment10 min readLW link

z is not the cause of x

hrbigelow23 Oct 2023 17:43 UTC

6 points

2 comments9 min readLW link

Closed-Source Evaluations

Jono8 Jun 2024 14:18 UTC

15 points

4 comments1 min readLW link

Will transparency help catch deception? Perhaps not

Matthew Barnett4 Nov 2019 20:52 UTC

43 points

5 comments7 min readLW link

Understanding Reasoning with Thought Anchors and Probes

JeaniceK, Matthew Robbins and Johannes Taraz

10 Mar 2026 11:50 UTC

14 points

0 comments13 min readLW link

Small foundational puzzle for causal theories of mechanistic interpretability

Frederik Hytting Jørgensen5 Jul 2025 17:46 UTC

6 points

5 comments2 min readLW link

The subset parity learning problem: much more than you wanted to know

Dmitry Vaintrob3 Jan 2025 9:13 UTC

106 points

19 comments11 min readLW link

How Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose

baimamboukar15 Jun 2026 18:50 UTC

11 points

0 comments6 min readLW link

Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment)

sunmoonron13 Dec 2025 0:03 UTC

12 points

2 comments2 min readLW link

Implicit vs. Explicit Gender Use Different Circuits for Pronoun Resolution in GPT-2 Small

NIKSHITH-G16 May 2026 15:18 UTC

1 point

0 comments8 min readLW link

Letting Claude do Autonomous Research to Improve SAEs

chanind10 Mar 2026 18:52 UTC

100 points

16 comments7 min readLW link

Some Notes on the mathematics of Toy Autoencoding Problems

carboniferous_umbraculum 22 Dec 2022 17:21 UTC

18 points

1 comment12 min readLW link

A Technical Primer on Mechanistic Interpretability

Alexei G19 Feb 2026 7:42 UTC

1 point

0 comments11 min readLW link

(alexeigannon.com)

Estimating the Probability of Sampling a Trained Neural Network at Random

Adam Scherlis and Nora Belrose

1 Mar 2025 2:11 UTC

33 points

10 comments1 min readLW link

(arxiv.org)

The Engineer’s Interpretability Sequence (EIS) I: Intro

scasper9 Feb 2023 16:28 UTC

46 points

24 comments3 min readLW link

Introspective Interpretability: a Definition, Motivation, and Open Problems

Belinda Li9 Feb 2026 23:53 UTC

10 points

0 comments13 min readLW link

DeepSeek Collapse Under Reflective Adversarial Pressure: A Case Study

unmodeledtyler26 Jan 2026 5:08 UTC

1 point

0 comments1 min readLW link

Glassbox: I built a circuit discovery toolkit that’s 37× faster than ACDC, with a new metric for cross-model alignment

designer-coderajay17 Mar 2026 18:54 UTC

1 point

0 comments3 min readLW link

Spontaneous introspection in output tampering

Ziqian Zhong26 Apr 2026 20:05 UTC

25 points

1 comment12 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

77 points

4 comments1 min readLW link

Fact Finding: How to Think About Interpreting Memorisation (Post 4)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

24 points

0 comments9 min readLW link

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky28 Feb 2025 12:01 UTC

21 points

1 comment14 min readLW link

(arxiv.org)

Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.

Yash Srivastava17 Jul 2024 2:02 UTC

3 points

2 comments5 min readLW link

I Made an AI Meditate. What Happened Next Was Weird.

onconc574@naver.com23 Feb 2026 2:23 UTC

1 point

0 comments3 min readLW link

Is It Reasoning or Just a Fixed Bias?

Sriram Kiron16 Jan 2026 21:43 UTC

14 points

0 comments1 min readLW link

(ramaway.com)

The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break

benwade24 Jan 2026 22:42 UTC

6 points

0 comments6 min readLW link

The Measurement Problem: Why AI Safety Research Keeps Missing What It’s Looking For

Евгений Андреевич16 Apr 2026 10:29 UTC

1 point

0 comments5 min readLW link

Topology as a Real-Time Integrity Flag for AI Systems: Working Results on GPT-2

samuelbfoster2 Mar 2026 2:03 UTC

1 point

0 comments2 min readLW link

Mechanistic interpretability through clustering

Alistair Fraser4 Dec 2023 18:49 UTC

1 point

0 comments1 min readLW link

Latent Reasoning Sprint #1: Tuned Lens and Logit Lens on CODI

Realmbird6 Mar 2026 18:36 UTC

7 points

1 comment4 min readLW link

Monet: Mixture of Monosemantic Experts for Transformers Explained

CalebMaresca25 Jan 2025 19:37 UTC

31 points

2 comments11 min readLW link

How Does A Blind Model See The Earth?

henry11 Aug 2025 19:58 UTC

500 points

42 comments7 min readLW link

(outsidetext.substack.com)

QNR prospects are important for AI alignment research

Eric Drexler3 Feb 2022 15:20 UTC

94 points

12 comments11 min readLW link 1 review

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

uri kialy25 Nov 2025 10:01 UTC

1 point

0 comments3 min readLW link

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo and Jason Hoelscher-Obermaier

17 Jun 2024 14:16 UTC

15 points

0 comments8 min readLW link

I used this repo to partially replicate Anthropic’s Emotion Concepts paper in a day

ewern21 Apr 2026 1:37 UTC

10 points

0 comments4 min readLW link

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna s7 Feb 2025 11:50 UTC

1 point

0 comments3 min readLW link

Exploring Llama-3-8B MLP Neurons

ntt1239 Jun 2024 14:19 UTC

10 points

0 comments4 min readLW link

(neuralblog.github.io)

The positional embedding matrix and previous-token heads: how do they actually work?

AdamYedidia10 Aug 2023 1:58 UTC

28 points

4 comments13 min readLW link

Against LLM Reductionism

Erich_Grunewald8 Mar 2023 15:52 UTC

140 points

17 comments18 min readLW link

(www.erichgrunewald.com)

Progress Report 4: logit lens redux

Nathan Helm-Burger8 Apr 2022 18:35 UTC

4 points

0 comments2 min readLW link

From Unruly Stacks to Organized Shelves: Toy Model Validation of Structured Priors in Sparse Autoencoders

Yuxiao6 Jul 2025 7:03 UTC

9 points

0 comments5 min readLW link

Entanglement and intuition about words and meaning

Bill Benzon4 Oct 2023 14:16 UTC

4 points

0 comments2 min readLW link

Measuring prosocial choice in AI under simulated deletion pressure

TeamSafeAI13 Oct 2025 18:55 UTC

1 point

0 comments2 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

9 Apr 2024 19:31 UTC

67 points

10 comments10 min readLW link

The REPHRASE Circuit: How Fine-Tuning Enhances LLMs to REPHRASE Text

Karthik Viswanathan6 Apr 2025 15:02 UTC

4 points

0 comments5 min readLW link

The Determinants of Controllable AGI

A2z25 Mar 2025 0:43 UTC

1 point

0 comments5 min readLW link

LLM Hallucinations: An Internal Tug of War

violazhong30 Oct 2025 1:21 UTC

1 point

0 comments5 min readLW link

The “Sparsity vs Reconstruction Tradeoff” Illusion

chanind and Adrià Garriga-alonso

26 Aug 2025 4:39 UTC

21 points

0 comments4 min readLW link

Finding Deception in Language Models

Esben Kran and Archana Vaidheeswaran

20 Aug 2024 9:42 UTC

20 points

4 comments4 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

7 Jul 2022 3:20 UTC

52 points

16 comments9 min readLW link

Understanding AI: A New Approach to AI Model Steering and Non-Symbolic Representations

R. Bonglious26 Sep 2025 0:50 UTC

1 point

0 comments4 min readLW link

Past Tense Features

Can20 Apr 2024 14:34 UTC

12 points

0 comments4 min readLW link

The Self-Hating Attention Head: A Deep Dive in GPT-2

Matteo Migliarini4 Jul 2025 13:07 UTC

12 points

0 comments7 min readLW link

Open Call for Research Assistants in Developmental Interpretability

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

30 Aug 2023 9:02 UTC

56 points

11 comments4 min readLW link

The Residual Stream Has a Geometry of Time

Fodenthal6 Jun 2026 19:57 UTC

23 points

0 comments8 min readLW link

Current themes in mechanistic interpretability research

Lee Sharkey, Sid Black and beren

16 Nov 2022 14:14 UTC

89 points

2 comments12 min readLW link

DSLT 0. Distilling Singular Learning Theory

Liam Carroll16 Jun 2023 9:50 UTC

96 points

8 comments5 min readLW link

If interpretability research goes well, it may get dangerous

So8res3 Apr 2023 21:48 UTC

203 points

11 comments2 min readLW link

Semantic Friction as an Alignment Signal: A Hypothesis from Outside the Field

OldPsycho9 May 2026 22:43 UTC

1 point

0 comments3 min readLW link

Backdoors have universal representations across large language models

Amirali Abdullah, Narmeen, Dhruv Nathawani and nirmalendu prakash

6 Dec 2024 22:56 UTC

18 points

0 comments16 min readLW link

Memetic Judo #3: The Intelligence of Stochastic Parrots v.2

Max TK20 Aug 2023 15:18 UTC

8 points

33 comments6 min readLW link

AI in a vat: Fundamental limits of efficient world modelling for safe agent sandboxing

Fernando Rosas1 Aug 2025 18:37 UTC

36 points

6 comments13 min readLW link

PhD Position: AI Interpretability in Berlin, Germany

Tiberius28 Apr 2023 13:44 UTC

3 points

0 comments1 min readLW link

(stephanw.net)

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

30 points

8 comments5 min readLW link

It turns out that DNNs are remarkably interpretable.

Maciej Satkiewicz4 Aug 2025 22:18 UTC

12 points

8 comments1 min readLW link

(arxiv.org)

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

8 Sep 2022 2:25 UTC

44 points

3 comments14 min readLW link

Thoughts on Formalizing Composition

Tom Lieberum7 Jun 2022 7:51 UTC

13 points

0 comments7 min readLW link

Unsupervised Discovery of Steering Vectors

Hrishik Sai Bojnal11 Mar 2026 17:21 UTC

1 point

0 comments6 min readLW link

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas, Julian Minder and Neel Nanda

30 Jun 2025 17:17 UTC

106 points

2 comments7 min readLW link

Recall and Regurgitation in GPT2

Megan Kinniment3 Oct 2022 19:35 UTC

43 points

1 comment26 min readLW link

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, hrdkbhatnagar and Joseph Bloom

20 Dec 2024 15:16 UTC

36 points

0 comments37 min readLW link

Understanding LLMs: Some basic observations about words, syntax, and discourse [w/ a conjecture about grokking]

Bill Benzon11 Oct 2023 19:13 UTC

6 points

0 comments5 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

Sparse Concept Anchoring

Sandy Fraser8 May 2025 8:59 UTC

6 points

0 comments3 min readLW link

Mapping ChatGPT’s ontological landscape, gradients and choices [interpretability]

Bill Benzon15 Oct 2023 20:12 UTC

1 point

0 comments18 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

30 Aug 2023 17:36 UTC

17 points

0 comments8 min readLW link

(arxiv.org)

Towards a Unified Interpretability of Artificial and Biological Neural Networks

Jan Bauer21 Dec 2024 23:10 UTC

2 points

0 comments1 min readLW link

Codesign for Legibility (to AI and Everyone Else)

Adam Chlipala5 May 2026 13:46 UTC

1 point

0 comments7 min readLW link

The model flagged its own manipulation in its thinking trace. Then complied anyway. Here’s the data.

Saadman Rafat24 Mar 2026 1:19 UTC

1 point

0 comments4 min readLW link

Localizing goal misgeneralization in a maze-solving policy network

Jan Betley6 Jul 2023 16:21 UTC

37 points

2 comments7 min readLW link

A Practical Experiment in Cross-Model Coordination Under Uncertainty

Timothy13 Dec 2025 22:53 UTC

1 point

0 comments2 min readLW link

LLMs Don’t Know Their Own Decision Boundaries. Why Is This Important?

harrymayne and ryanothnielkearns

17 Sep 2025 16:39 UTC

9 points

0 comments5 min readLW link

(arxiv.org)

Weird Features in Protein LLMs: The Gram Lens

Jude Stiel14 Jul 2025 17:32 UTC

11 points

0 comments9 min readLW link

Notes on “Explaining AI Explainability”

Eleni Angelou24 Oct 2025 17:22 UTC

20 points

0 comments6 min readLW link

Why Eliminating Deception Won’t Align AI

Priyanka Bharadwaj15 Jul 2025 9:21 UTC

19 points

6 comments4 min readLW link

SAE Feature Matchmaking (Layer-to-Layer)

Mitali M10 Feb 2026 4:32 UTC

9 points

0 comments1 min readLW link

Scaling Laws and Superposition

Pavan Katta10 Apr 2024 15:36 UTC

9 points

4 comments5 min readLW link

(www.pavankatta.com)

Alignment Stress Signatures: When Safe AI Behaves like It’s Traumatized

Petra Vojtaššáková26 Oct 2025 9:28 UTC

1 point

0 comments2 min readLW link

Mathematical Circuits in Neural Networks

Sean Osier22 Sep 2022 3:48 UTC

34 points

4 comments1 min readLW link

(www.youtube.com)

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

6 Jan 2025 4:22 UTC

19 points

0 comments12 min readLW link

Feature Geometry, Topology, and Holonomy in Wide Data

Invariant28 Mar 2026 9:06 UTC

1 point

0 comments3 min readLW link

Incidental polysemanticity

Victor Lecomte, Kushal Thaman, tmychow and Rylan Schaeffer

15 Nov 2023 4:00 UTC

43 points

7 comments11 min readLW link

Singular Learning Theory Comprehensive − 1

Agastya Agrawal20 May 2026 20:00 UTC

35 points

1 comment12 min readLW link

It’s important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation

Gerard Boxo14 Oct 2024 17:04 UTC

9 points

0 comments6 min readLW link

(gboxo.github.io)

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

29 Jan 2025 10:25 UTC

71 points

0 comments1 min readLW link

(arxiv.org)

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

1 comment17 min readLW link

Building a Multi Model Test System for AI Research

Joshua Grikas10 Jan 2026 20:03 UTC

1 point

0 comments4 min readLW link

Finding Features in Neural Networks with the Empirical NTK

jylin0416 Oct 2025 18:04 UTC

37 points

1 comment5 min readLW link

Small language models hallucinate knowing something’s off.

Toheed24 Jan 2026 22:46 UTC

12 points

0 comments5 min readLW link

Modelling Trajectories—Interim results

NickyP, Einar Urdshals, Micurie and Éloïse Benito-Rodriguez

4 Dec 2025 13:34 UTC

11 points

0 comments4 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC

16 points

0 comments8 min readLW link

LLM Thought Detector

R. Bonglious9 Oct 2025 21:51 UTC

1 point

0 comments4 min readLW link

Beyond RLHF: Implementing Ontological Guardrails via Relational Coherence

Kinsey Kappler26 Jan 2026 5:06 UTC

1 point

0 comments4 min readLW link

Emergent introspection does not replicate on Llama-3.1-405B

Nick Merrill11 May 2026 4:05 UTC

9 points

0 comments6 min readLW link

When Imitation Is Cheap, Resistance Is Informative

cbae7 Feb 2026 14:11 UTC

1 point

0 comments4 min readLW link

Expanding the Scope of Superposition

Derek Larson13 Sep 2023 17:38 UTC

10 points

0 comments4 min readLW link

Current state of AI bias benchmarks

N Soma Sekhar28 Feb 2026 13:12 UTC

1 point

0 comments1 min readLW link

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Sodium3 Oct 2024 19:11 UTC

41 points

17 comments17 min readLW link

Investigating Internal Representations of Correctness in SONAR Text Autoencoders

Samuel Nellessen and antonghawthorne

6 Aug 2025 12:13 UTC

5 points

0 comments7 min readLW link

EIS XII: Summary

scasper23 Feb 2023 17:45 UTC

19 points

0 comments6 min readLW link

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang, Miles Wang and kaivu

15 Dec 2023 11:05 UTC

34 points

8 comments10 min readLW link

Interpretability Externalities Case Study—Hungry Hungry Hippos

Magdalena Wache20 Sep 2023 14:42 UTC

64 points

22 comments2 min readLW link

Geometric Features for AI Uncertainty: A Targeted Tool for Safety-Critical Regions

DillanJC18 Jan 2026 16:01 UTC

1 point

0 comments2 min readLW link

Irreducible representations versus cosets: a discriminating experiment on a same-character-table group pair

Brook Stefanou15 Jun 2026 6:09 UTC

1 point

0 comments18 min readLW link

(brook-stefanou.github.io)

Idea: Network modularity and interpretability by sexual reproduction

qbolec12 Feb 2023 23:06 UTC

3 points

3 comments1 min readLW link

Studying Mechanistic of Alignment Faking in Llama-3.1-405B

Amina Keldibek25 Nov 2025 11:21 UTC

10 points

0 comments11 min readLW link

A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Alexandre Variengien and Eric Winsor

19 Dec 2023 11:52 UTC

84 points

3 comments10 min readLW link

(arxiv.org)

ChatGPT tells stories, and a note about reverse engineering: A Working Paper

Bill Benzon3 Mar 2023 15:12 UTC

3 points

0 comments3 min readLW link

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger

23 Nov 2024 17:30 UTC

42 points

0 comments5 min readLW link

Fluent dreaming for language models (AI interpretability method)

tbenthompson, mikes and Zygi Straznickas

6 Feb 2024 6:02 UTC

46 points

5 comments1 min readLW link

(arxiv.org)

“I’ve observed a recurring pattern across frontier LLMs where, as multi-step reasoning depth increases, models sometimes maintain internal/persona coherence while drifting from semantic truth-states. I’m sharing this to ask whether this behavior is a known scaling byproduct or an evaluation blind spot. Example traces available if useful.”

Aryan 30 Dec 2025 16:21 UTC

1 point

0 comments1 min readLW link

Introspection via localization

Victor Godet28 Dec 2025 14:26 UTC

36 points

8 comments3 min readLW link

Basic Facts about Language Model Internals

beren and Eric Winsor

4 Jan 2023 13:01 UTC

134 points

19 comments9 min readLW link

Mech Interp Lacks Good Paradigms

Daniel Tan16 Jul 2024 15:47 UTC

40 points

0 comments14 min readLW link

Trying to find the underlying structure of computational systems

Matthias G. Mayer13 Sep 2022 21:16 UTC

21 points

9 comments4 min readLW link

[Question] Have you heard about MIT’s “liquid neural networks”? What do you think about them?

Ppau9 May 2023 20:16 UTC

35 points

14 comments1 min readLW link

Labelling, Variables, and In-Context Learning in Llama2

Joshua Penman3 Aug 2024 19:36 UTC

6 points

0 comments1 min readLW link

(colab.research.google.com)

Fact Finding: Simplifying the Circuit (Post 2)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:45 UTC

27 points

3 comments14 min readLW link

Towards data-centric interpretability with sparse autoencoders

Nick Jiang, lilysun004, lewis smith and Neel Nanda

15 Aug 2025 20:10 UTC

57 points

2 comments18 min readLW link

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful

jacob_drori9 Feb 2026 23:25 UTC

136 points

5 comments8 min readLW link

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

14 Nov 2024 13:06 UTC

28 points

0 comments9 min readLW link

The limited upside of interpretability

Peter S. Park15 Nov 2022 18:46 UTC

13 points

11 comments10 min readLW link

An introduction to language model interpretability

Alexandre Variengien20 Apr 2023 22:22 UTC

14 points

0 comments9 min readLW link

Spooky action at a distance in the loss landscape

Jesse Hoogland and Filip Sondej

28 Jan 2023 0:22 UTC

59 points

4 comments7 min readLW link

(www.jessehoogland.com)

Estimating effective dimensionality of MNIST models

Arjun Panickssery2 Nov 2023 14:13 UTC

41 points

3 comments1 min readLW link

Applying Network Motif Analysis to Transformer Attribution Graphs

mkenney7 Feb 2026 23:02 UTC

1 point

0 comments10 min readLW link

Gears-Level Mental Models of Transformer Interpretability

RowanWang29 Mar 2022 20:09 UTC

76 points

4 comments6 min readLW link

The Alignment Problem Is Upstream of the Model

edward-lcl12 Apr 2026 19:37 UTC

1 point

0 comments14 min readLW link

A Steering Vector for SQL Injection Vulnerabilities in Phi-1.5

Kirill Dubovikov17 Sep 2025 5:54 UTC

5 points

2 comments8 min readLW link

Is AI Alignment Missing an Interpretive Layer? A Canon-Based Framework for Governing Model Reasoning

Jason Young20 Oct 2025 19:33 UTC

1 point

0 comments3 min readLW link

[Proposal] Isomorphic Consolidation: A Protocol for Continuous Entropy Reduction via Offline Topology Search

Valen28 Nov 2025 3:11 UTC

1 point

0 comments2 min readLW link

An Analogy for Interpretability

Roman Malov24 Jun 2025 14:56 UTC

13 points

2 comments2 min readLW link

Explaining SAE Features With Foreign Natural Language Autoencoders

fzaffino5 Jun 2026 17:51 UTC

17 points

1 comment8 min readLW link

Flamingos (among other things) reduce emergent misalignment

eekay19 Feb 2026 19:17 UTC

13 points

3 comments7 min readLW link

Normalizing Sparse Autoencoders

Fengyuan Hu8 Apr 2024 6:17 UTC

22 points

18 comments13 min readLW link

State Space Methods Are a Useful Way to Think About Deceptive Alignment

Tom Kimpson9 Mar 2026 8:49 UTC

1 point

0 comments17 min readLW link

Analysing CoT alignment in thinking LLMs with low-dimensional steering

edoinni13 Jan 2026 20:45 UTC

6 points

0 comments7 min readLW link

Biases in Biases, or Critique of the Critique

ThePathYouWillChoose19 Aug 2024 17:11 UTC

1 point

0 comments1 min readLW link

Initial Experiments Using SAEs to Help Detect AI Generated Text

Aaron_Scher22 Jul 2024 5:16 UTC

18 points

1 comment14 min readLW link

A short project on Mamba: grokking & interpretability

Alejandro Tlaie18 Oct 2024 16:59 UTC

21 points

0 comments6 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

18 points

4 comments20 min readLW link

Wisdom-Oriented Alignment: A Framework With Testable Predictions — Seeking Feedback and Empirical Collaborators

ryanlupardus7 May 2026 14:27 UTC

1 point

0 comments10 min readLW link

The Conceptual Topography Hypothesis: Why Emergence in LLMs Isn’t Just About Scale

ravikiran nm6 Jul 2025 13:16 UTC

1 point

0 comments6 min readLW link

Anthropic’s JumpReLU training method is really good

chanind and Adrià Garriga-alonso

3 Oct 2025 15:23 UTC

52 points

2 comments2 min readLW link

Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen and Mateusz Dziemian

15 Jul 2024 17:07 UTC

33 points

2 comments7 min readLW link

But is it really in Rome? An investigation of the ROME model editing technique

jacquesthibs30 Dec 2022 2:40 UTC

105 points

2 comments18 min readLW link

Zero- and mean-ablation disagree about which head drives this SAE feature (Gemma-2-2B induction)

sohumsen20 May 2026 13:15 UTC

1 point

0 comments6 min readLW link

Latent Reasoning Sprint #4: PCA Analysis on CoDI

Realmbird18 Apr 2026 21:25 UTC

7 points

0 comments3 min readLW link

AI Self Portraits Aren’t Accurate

JustisMills27 Apr 2025 3:27 UTC

59 points

10 comments5 min readLW link

The king token

p.b.28 May 2023 19:18 UTC

17 points

0 comments4 min readLW link

The Stability of Understanding: What Compression Decay Reveals About LLMs

rb12516 Nov 2025 18:48 UTC

1 point

0 comments2 min readLW link

Constructing Neural Network Parameters with Downstream Trainability

ch271828n31 Jul 2024 18:13 UTC

1 point

0 comments1 min readLW link

(github.com)

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Jordan Taylor30 Jan 2026 15:50 UTC

31 points

0 comments8 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

59 points

8 comments2 min readLW link

(arxiv.org)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems

Sonia Joseph and Neel Nanda

13 Mar 2024 17:09 UTC

44 points

13 comments14 min readLW link

Vectorial Consensus: A Native Description of Token Generation

Fernando James Pitso 28 Nov 2025 16:09 UTC

1 point

0 comments4 min readLW link

Exploring capability gated out-of-context reasoning

Rob Kopel9 Apr 2026 5:23 UTC

8 points

0 comments12 min readLW link

(www.robkopel.me)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

17 Aug 2024 1:16 UTC

54 points

0 comments5 min readLW link

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller, bilalchughtai and William_S

12 Jul 2024 3:47 UTC

104 points

5 comments7 min readLW link

(arxiv.org)

Understanding understanding

mthq23 Aug 2019 18:10 UTC

25 points

1 comment2 min readLW link

Barriers to Mechanistic Interpretability for AGI Safety

Connor Leahy29 Aug 2023 10:56 UTC

63 points

13 comments1 min readLW link

(www.youtube.com)

My current workflow to study the internal mechanisms of LLM

Yulu Pi16 May 2023 15:27 UTC

4 points

0 comments1 min readLW link

AI Mood Ring: A Window Into LLM Emotions

michaelwaves6 Dec 2025 2:56 UTC

7 points

0 comments2 min readLW link

When Does the Local Learning Coefficient Track Circuit Formation?

Bhavith Chandra8 May 2026 17:50 UTC

1 point

0 comments2 min readLW link

Causality and a Cost Semantics for Neural Networks

scottviteri21 Aug 2023 21:02 UTC

22 points

1 comment1 min readLW link

Interpreting Complexity

Maxwell Adam14 Mar 2025 4:52 UTC

54 points

8 comments26 min readLW link

Latent Confusion—The Many Meanings Hidden Behind AI’s Favourite Word

robman3 Nov 2025 3:49 UTC

1 point

0 comments7 min readLW link

(latentgeometrylab.robman.fyi)

An idea for avoiding neuralese architectures

Knight Lee3 Apr 2025 22:23 UTC

17 points

2 comments4 min readLW link

Interpreting Modular Addition in MLPs

Bart Bussmann7 Jul 2023 9:22 UTC

22 points

0 comments6 min readLW link

Do we need sparsity afterall?

Giuseppe Birardi6 Jan 2026 6:06 UTC

20 points

5 comments29 min readLW link

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony Diamond18 Mar 2025 18:03 UTC

10 points

2 comments1 min readLW link

graphpatch: a Python Library for Activation Patching

Evan Lloyd5 Jun 2024 15:08 UTC

14 points

2 comments1 min readLW link

When Are Concealment Features Learned? And Does the Model Know Who’s Watching?

James Hoffend19 Dec 2025 8:19 UTC

13 points

1 comment6 min readLW link

Matryoshka Sparse Autoencoders

Noa Nabeshima14 Dec 2024 2:52 UTC

100 points

15 comments11 min readLW link

[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers

Bogdan Ionut Cirstea4 Aug 2023 15:29 UTC

11 points

0 comments1 min readLW link

Visualizing Learned Representations of Rice Disease

muhia_bee3 Oct 2022 9:09 UTC

7 points

0 comments4 min readLW link

(indecisive-sand-24a.notion.site)

Hourglass Topology & Spillover Dynamics: A Physical-Layer Defense Against Jailbreaks

Qi Feng.IVAS17 May 2026 13:06 UTC

1 point

0 comments10 min readLW link

ChatGPT: Tantalizing afterthoughts in search of story trajectories [induction heads]

Bill Benzon3 Feb 2023 10:35 UTC

4 points

0 comments20 min readLW link

Creating a Discord server for Mechanistic Interpretability Projects

Victor Levoso12 Mar 2023 18:00 UTC

31 points

6 comments2 min readLW link

Sustained No-Self as a Human Prototype of Corrigible, Non-Deceptive Agency: A First-Person Longitudinal Record

itsdone16 Jan 2026 3:46 UTC

1 point

0 comments1 min readLW link

A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

Joseph Bloom16 May 2023 22:59 UTC

36 points

2 comments16 min readLW link

Trying to isolate objectives: approaches toward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC

49 points

14 comments8 min readLW link

Monosemanticity & Quantization

Rahul Chand22 Oct 2024 22:57 UTC

1 point

0 comments9 min readLW link

Toy Models of Superposition: Simplified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC

9 points

3 comments8 min readLW link

Why Residual Streams Are the Wrong Place to Probe for Safety Signals

VecPLabs17 Jan 2026 5:21 UTC

1 point

0 comments5 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen5 Dec 2024 19:24 UTC

5 points

2 comments10 min readLW link

Gradient Anatomy’s—Hallucination Robustness in Medical Q&A

DieSab12 Feb 2025 19:16 UTC

2 points

0 comments10 min readLW link

A day in the life of a mechanistic interpretability researcher

Bill Benzon28 Nov 2023 14:45 UTC

3 points

3 comments1 min readLW link

Computational Superposition in a Toy Model of the U-AND Problem

Adam Newgas27 Mar 2025 16:56 UTC

18 points

2 comments11 min readLW link

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping Yu26 Dec 2023 0:36 UTC

7 points

1 comment11 min readLW link

L0 is not a neutral hyperparameter

chanind and Adrià Garriga-alonso

19 Jul 2025 13:51 UTC

24 points

3 comments5 min readLW link

What’s going on? LLMs and IS-A sentences

Bill Benzon8 Nov 2023 16:58 UTC

6 points

15 comments4 min readLW link

Steering Evaluation-Aware Models to Act Like They Are Deployed

Tim Hua, andrq, Sam Marks and Neel Nanda

30 Oct 2025 15:03 UTC

62 points

12 comments18 min readLW link

Are SAE features from the Base Model still meaningful to LLaVA?

Shan23Chen18 Feb 2025 22:16 UTC

8 points

2 comments10 min readLW link

(www.lesswrong.com)

Deliberative Credit Assignment: Making Faithful Reasoning Profitable

Florian_Dietz14 Jul 2025 9:26 UTC

10 points

3 comments17 min readLW link

AI-Generated GitHub repo backdated with junk then filled with my systems work. Has anyone seen this before?

rgunther1 May 2025 20:14 UTC

7 points

1 comment1 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

4 Apr 2025 14:53 UTC

29 points

0 comments8 min readLW link

Tracing Typos in LLMs: My Attempt at Understanding How Models Correct Misspellings

Ivan Dostal2 Feb 2025 19:56 UTC

11 points

2 comments5 min readLW link

Enabling New Applications with Today’s Mechanistic Interpretability Toolkit

ananya_joshi25 Oct 2024 17:53 UTC

3 points

0 comments3 min readLW link

Representation Tuning

Christopher Ackerman27 Jun 2024 17:44 UTC

35 points

9 comments13 min readLW link

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

Ambiguous out-of-distribution generalization on an algorithmic task

Wilson Wu and Louis Jaburi

13 Feb 2025 18:24 UTC

84 points

6 comments11 min readLW link

Side quests in curriculum learning and regularization

Sandy Fraser15 Jun 2025 2:03 UTC

6 points

0 comments10 min readLW link

A Chess-GPT Linear Emergent World Representation

Adam Karvonen8 Feb 2024 4:25 UTC

106 points

14 comments7 min readLW link

(adamkarvonen.github.io)

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

Analysis of Metastable States in the Transformer Activation Space

Zach Baker6 Jun 2026 21:30 UTC

10 points

0 comments20 min readLW link

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

32 comments2 min readLW link

Harmfulness Directions in OLMo

Daniele Pace, Bryan Maruyama and LorenzoPacchiardi

9 Jun 2026 22:31 UTC

20 points

0 comments11 min readLW link

Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment

RogerDearnaley7 Dec 2023 6:14 UTC

11 points

0 comments11 min readLW link

Localized Safety Subnetworks in Llama-3-70B

Oleksandr Kravchenko24 Mar 2026 8:34 UTC

1 point

0 comments1 min readLW link

Exploring the multi-dimensional refusal subspace in reasoning models

Le magicien quantique27 Oct 2025 9:03 UTC

5 points

2 comments4 min readLW link

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey

17 May 2024 16:25 UTC

57 points

20 comments4 min readLW link

(arxiv.org)

Uncovering Unfaithful CoT in Deceptive Models

Agastya Agrawal22 Jan 2026 1:46 UTC

12 points

2 comments3 min readLW link

AISC project: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC

26 points

0 comments4 min readLW link

A personal explanation of ELK concept and task.

Zeyu Qin6 Oct 2023 3:55 UTC

1 point

0 comments1 min readLW link

LLM Self-Reference Language in Multilingual vs English-Centric Models

dwmd22 Oct 2025 12:44 UTC

5 points

0 comments6 min readLW link

Steering LLMs’ Behavior with Concept Activation Vectors

Ruixuan Huang28 Sep 2024 9:53 UTC

9 points

0 comments10 min readLW link

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Max Fomin23 Feb 2026 23:44 UTC

1 point

2 comments6 min readLW link

No convincing evidence for gradient descent in activation space

Blaine12 Apr 2023 4:48 UTC

86 points

9 comments20 min readLW link

Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition

nika koghuashvili27 Jan 2026 2:25 UTC

5 points

0 comments5 min readLW link

Progress report 3: clustering transformer neurons

Nathan Helm-Burger5 Apr 2022 23:13 UTC

5 points

0 comments2 min readLW link

[Question] LLM/AI hype

Student19283746515 Jun 2024 20:12 UTC

1 point

0 comments1 min readLW link

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Tom Angsten30 Jul 2024 16:36 UTC

6 points

0 comments9 min readLW link

Minor interpretability exploration #2: Extending superposition to different activation functions

Rareș Baron6 Mar 2025 11:22 UTC

3 points

0 comments4 min readLW link

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejer9 May 2025 21:29 UTC

4 points

1 comment6 min readLW link

Interpretability Tools Are an Attack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC

42 points

14 comments1 min readLW link

Intervening on Sparse, Anchored Concepts

Sandy Fraser14 May 2026 4:35 UTC

24 points

3 comments10 min readLW link

Using Base-LCM to Monitor LLMs

Éloïse Benito-Rodriguez and NickyP

6 May 2026 19:28 UTC

−1 points

0 comments4 min readLW link

Addendum: More Efficient FFNs via Attention

Robert_AIZI6 Feb 2023 18:55 UTC

10 points

2 comments5 min readLW link

(aizi.substack.com)

Still no Lie Detector for LLMs

Daniel Herrmann and ben_levinstein

18 Jul 2023 19:56 UTC

50 points

3 comments21 min readLW link

Scientists make sense of shapes in the minds of the models

Mordechai Rorvig29 Nov 2025 16:00 UTC

2 points

0 comments1 min readLW link

(www.foommagazine.org)

Category-Theoretic Wanderings into Interpretability

unruly abstractions2 Sep 2025 0:03 UTC

19 points

2 comments1 min readLW link

(www.unrulyabstractions.com)

Defeating Introspection Adapters (and Why Threat Models Matter)

Nick Merrill and zekem

4 Jun 2026 18:39 UTC

10 points

0 comments5 min readLW link

A proposal for iterated interpretability with known-interpretable narrow AIs

Peter Berggren11 Jan 2025 14:43 UTC

6 points

0 comments2 min readLW link

An interactive introduction to grokking and mechanistic interpretability

Adam Pearce and Asma Ghandeharioun

7 Aug 2023 19:09 UTC

23 points

3 comments1 min readLW link

(pair.withgoogle.com)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Scaling Sparse Feature Circuit Finding to Gemma 9B

Diego Caples, Jatin Nainani, CallumMcDougall and rrenaud

10 Jan 2025 11:08 UTC

88 points

11 comments17 min readLW link

Give Neo a Chance

ank6 Mar 2025 1:48 UTC

3 points

7 comments7 min readLW link

Paying attention to Attention Sinks

Mitali M23 Jan 2026 21:40 UTC

11 points

5 comments1 min readLW link

Live Conversational Threads: Not an AI Notetaker

adiga3 Nov 2025 4:24 UTC

19 points

0 comments7 min readLW link

A Symbolic Model for Recursive Interpretation and Containment in LLMs

Desjuan13 Jul 2025 19:33 UTC

1 point

0 comments1 min readLW link

How Far Can Observation Take Us?

unruly abstractions23 Mar 2026 21:56 UTC

12 points

0 comments9 min readLW link

What Happens When You Train Models on False Facts?

David Vella Zarb6 Dec 2025 1:39 UTC

1 point

0 comments8 min readLW link

Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

Jason R Brown and Edward James Young

29 May 2026 9:56 UTC

67 points

0 comments7 min readLW link

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

46 points

6 comments10 min readLW link

A sudoku-solving transformer represents the board by substructure, not by cell

r_knzv15 Apr 2026 1:55 UTC

22 points

0 comments16 min readLW link

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ntt12317 Jun 2024 11:46 UTC

5 points

4 comments6 min readLW link

(neuralblog.github.io)

Interpretability is the best path to alignment

Archie Chaudhury5 Sep 2025 4:37 UTC

2 points

6 comments5 min readLW link

Rohin Shah on reasons for AI optimism

abergal31 Oct 2019 12:10 UTC

40 points

58 comments1 min readLW link

(aiimpacts.org)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

Diego Caples and rrenaud

6 Sep 2024 17:55 UTC

74 points

8 comments4 min readLW link

Auto-matching hidden layers in Pytorch LLMs

chanind19 Feb 2024 12:40 UTC

2 points

0 comments3 min readLW link

Knowledge Base 1: Could it increase intelligence and make it safer?

iwis30 Sep 2024 16:00 UTC

−4 points

0 comments4 min readLW link

[Question] SAE sparse feature graph using only residual layers

Jaehyuk Lim23 May 2024 13:32 UTC

0 points

3 comments1 min readLW link

Antonym Heads Predict Semantic Opposites in Language Models

Jake Ward15 Nov 2024 15:32 UTC

3 points

0 comments5 min readLW link

Themes in AI Agent Self-Chosen Prompts Correlate Strongly with Architecture

sdeture10 Dec 2025 23:04 UTC

1 point

0 comments2 min readLW link

Models have linear representations of what tasks they like

OscarGilg5 Mar 2026 18:44 UTC

54 points

16 comments11 min readLW link

The Temporal Immune System: Cross-Session Behavioral Monitoring as a Fourth Defense Axis

Daniel Bartz21 Feb 2026 0:18 UTC

1 point

0 comments1 min readLW link

Statistical suggestions for mech interp research and beyond

Paul Bogdan6 Aug 2025 12:45 UTC

65 points

4 comments15 min readLW link

Transformers As Ambiguity-Resolving Machines—An enthusiast’s take

Ravi Khandelwal19 Feb 2026 16:06 UTC

1 point

0 comments2 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

5 Jul 2024 17:05 UTC

66 points

2 comments5 min readLW link

I Found Catastrophe Geometry in GPT-2′s Residual Stream

Karli Joy20 Feb 2026 13:08 UTC

1 point

0 comments11 min readLW link

Alternative Models of Superposition

Zephaniah Roe and RGRGRG

11 Aug 2025 15:52 UTC

20 points

6 comments5 min readLW link

The Semantic Hazard: When Meaning, Not Data, Becomes the Failure Mode

Paul Kachris-Newman19 Dec 2025 3:14 UTC

1 point

0 comments4 min readLW link

Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:47 UTC

77 points

1 comment8 min readLW link

How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC

48 points

5 comments13 min readLW link

Multi-Component Learning and S-Curves

Adam Jermyn and Buck

30 Nov 2022 1:37 UTC

63 points

24 comments7 min readLW link

Questions I’d Want to Ask an AGI+ to Test Its Understanding of Ethics

sweenesm26 Jan 2024 23:40 UTC

14 points

6 comments4 min readLW link

Sparse Features Through Time

Rogan Inglis24 Jun 2024 18:06 UTC

12 points

1 comment1 min readLW link

(roganinglis.io)

Internal Interfaces Are a High-Priority Interpretability Target

Thane Ruthenis29 Dec 2022 17:49 UTC

26 points

6 comments7 min readLW link

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Georg Lange, RGRGRG, Kat Dearstyne and Kamal Maher

2 Feb 2026 21:32 UTC

46 points

6 comments18 min readLW link

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC

26 points

5 comments20 min readLW link

LLM: From Black Box to White Box, Just a Normalization Away

Winamin6 Jun 2026 16:42 UTC

1 point

0 comments4 min readLW link

Gradient Descent on Token Input Embeddings

KAP24 Jun 2025 20:24 UTC

8 points

1 comment6 min readLW link

Explaining SolidGoldMagikarp by looking at it from random directions

Robert_AIZI14 Feb 2023 14:54 UTC

8 points

0 comments8 min readLW link

(aizi.substack.com)

Steering RL Training: Benchmarking Interventions Against Reward Hacking

ariaw, Josh Engels and Neel Nanda

29 Dec 2025 21:55 UTC

72 points

11 comments19 min readLW link

Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

Artt15 Jun 2026 1:51 UTC

8 points

1 comment9 min readLW link

(artcore.pages.dev)

Exploring OpenAI’s Latent Directions: Tests, Observations, and Poking Around

Johnny Lin31 Jan 2024 6:01 UTC

26 points

4 comments14 min readLW link

LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space

NickyP13 Feb 2023 18:52 UTC

85 points

11 comments15 min readLW link

Trying to approximate Statistical Models as Scoring Tables

Jsevillamol29 Jun 2021 17:20 UTC

18 points

2 comments9 min readLW link

Analysis of Whisper-Tiny Using Sparse Autoencoders

Omar Khursheed21 Dec 2025 8:44 UTC

8 points

0 comments4 min readLW link

The Two-Board Problem: Training Environment for Research Agents

Valerii K.8 Feb 2026 23:13 UTC

4 points

0 comments9 min readLW link

Growth and Form in a Toy Model of Superposition

Liam Carroll and Edmund Lau

8 Nov 2023 11:08 UTC

92 points

7 comments14 min readLW link

How much superposition is there?

chanind and Adrià Garriga-alonso

18 Feb 2026 13:53 UTC

25 points

0 comments3 min readLW link

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence

Aakash Rana28 Dec 2025 18:21 UTC

7 points

2 comments5 min readLW link

An OV-Coherent Toy Model of Attention Head Superposition

Lauren Greenspan and keith_wynroe

29 Aug 2023 19:44 UTC

26 points

2 comments6 min readLW link

[Question] AI interpretability could be harmful?

Roman Leventov10 May 2023 20:43 UTC

13 points

2 comments1 min readLW link

Understanding Hidden Computations in Chain-of-Thought Reasoning

Ram Bharadwaj24 Aug 2024 16:35 UTC

6 points

1 comment1 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisoham28 Feb 2025 20:22 UTC

3 points

0 comments9 min readLW link

Mechanistic Interpretability Reading group

1stuserhere and woog

26 Sep 2023 16:26 UTC

15 points

0 comments1 min readLW link

Research Questions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC

4 points

0 comments2 min readLW link

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis Jucys8 Dec 2023 13:18 UTC

16 points

1 comment4 min readLW link

(arxiv.org)

Don’t you mean “the most conditionally forbidden technique?”

Knight Lee26 Apr 2025 3:45 UTC

19 points

0 comments3 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett Janiak, cmathw and StefanHex

9 Nov 2023 16:16 UTC

51 points

0 comments6 min readLW link

Darwin: A Self-Evolving Cognitive Shell Around LLMs From Prompt to Program

ahmadrizq9 Jul 2025 9:41 UTC

1 point

0 comments2 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex and Marius Hobbhahn

25 May 2023 15:37 UTC

71 points

1 comment13 min readLW link

Anomalous Tokens in DeepSeek-V3 and r1

henry25 Jan 2025 22:55 UTC

145 points

3 comments7 min readLW link

Interpretability isn’t Free

Joel Burget4 Aug 2022 15:02 UTC

12 points

1 comment2 min readLW link

Space view

kapedalex19 Dec 2025 14:20 UTC

5 points

0 comments6 min readLW link

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq, jake_mendel, StefanHex and Kaarel

20 May 2024 17:55 UTC

24 points

7 comments6 min readLW link

Introducing the Wisdom Forcing Function™: An Innovation Dividend from Dialectical Alignment

CarlosArleo5 Oct 2025 20:13 UTC

1 point

0 comments1 min readLW link

Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems

RiekeFruengel12 Jan 2026 19:31 UTC

13 points

0 comments6 min readLW link

How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Collin15 Dec 2022 18:22 UTC

244 points

41 comments16 min readLW link 1 review

Practical Pitfalls of Causal Scrubbing

Jérémy Scheurer, Phil3, tony, jacquesthibs and David Lindner

27 Mar 2023 7:47 UTC

89 points

17 comments13 min readLW link

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan Anders19 Dec 2023 18:56 UTC

14 points

0 comments8 min readLW link

(evanhanders.blog)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

27 Feb 2026 3:20 UTC

60 points

12 comments78 min readLW link

Training Models to Detect Activation Steering: Results and Implications

josh :)26 Nov 2025 14:51 UTC

12 points

0 comments4 min readLW link

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind, TomasD, hrdkbhatnagar and Joseph Bloom

25 Sep 2024 9:31 UTC

74 points

19 comments3 min readLW link

(arxiv.org)

How to Design Environments for Understanding Model Motives

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

2 Mar 2026 7:14 UTC

46 points

0 comments10 min readLW link

From Messy Shelves to Master Librarians: Toy-Model Exploration of Block-Diagonal Geometry in LM Activations

Yuxiao19 Jul 2025 12:26 UTC

6 points

1 comment4 min readLW link

How Do Language Models Understand What You ReallyMean?

stebloom1224 Mar 2026 2:59 UTC

1 point

0 comments13 min readLW link

Grammars, subgrammars, and combinatorics of generalization in transformers

Dmitry Vaintrob2 Jan 2025 9:37 UTC

36 points

0 comments17 min readLW link

What’s in the box?! – Towards interpretability by distinguishing niches of value within neural networks.

Joshua Clancy29 Feb 2024 18:33 UTC

3 points

4 comments128 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

63 points

0 comments12 min readLW link

Task vectors & analogy making in LLMs

Sergii8 Jan 2024 15:17 UTC

9 points

1 comment4 min readLW link

(grgv.xyz)

How To Do Patching Fast

Joseph Miller11 May 2024 20:13 UTC

44 points

8 comments4 min readLW link

Introducing SARA: a new activation steering technique

Alejandro Tlaie9 Jun 2024 15:33 UTC

17 points

7 comments6 min readLW link

By Default, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC

90 points

36 comments9 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

62 points

10 comments4 min readLW link

The Illusion of Transparency as a Trust-Building Mechanism

Priyanka Bharadwaj19 Mar 2025 17:09 UTC

2 points

0 comments1 min readLW link

Superweight Surgery: Repairing “Brain Damage” in OLMo-1B with a Single Row Patch

sunmoonron13 Dec 2025 0:02 UTC

1 point

0 comments2 min readLW link

Deep learning as program synthesis

Zach Furman20 Jan 2026 15:35 UTC

150 points

33 comments41 min readLW link

The Measure Is the Medium: Subliminal Learning as Inherited Ontology in LLMs

Koen vande Glind (McGluut)11 Aug 2025 10:18 UTC

1 point

0 comments4 min readLW link

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Maheep Chaudhary19 Dec 2025 2:47 UTC

21 points

0 comments6 min readLW link

How does a toy 2 digit subtraction transformer predict the difference?

Evan Anders22 Dec 2023 21:17 UTC

12 points

0 comments10 min readLW link

(evanhanders.blog)

Towards White Box Deep Learning

Maciej Satkiewicz27 Mar 2024 18:20 UTC

18 points

5 comments1 min readLW link

(arxiv.org)

Interpreting Embedding Spaces by Conceptualization

Adi Simhi28 Feb 2023 18:38 UTC

3 points

0 comments1 min readLW link

(arxiv.org)

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

7 Nov 2024 5:22 UTC

67 points

4 comments14 min readLW link

Superposition through Active Learning Lens

akankshanc17 Sep 2024 17:32 UTC

1 point

0 comments10 min readLW link

Spectral radii dimensionality reduction computed without gradient calculations

Joseph Van Name28 May 2025 5:06 UTC

5 points

4 comments6 min readLW link

Frontier LLMs retain correct knowledge as a negative constraint when fabricating under authoritative framing: a controlled probe

Anuar Kiryataim Contreras Malagón14 Apr 2026 14:31 UTC

1 point

0 comments7 min readLW link

A short critique of Omohundro’s “Basic AI Drives”

Soumyadeep Bose19 Dec 2024 19:19 UTC

6 points

0 comments4 min readLW link

OthelloGPT learned a bag of heuristics

jylin04, JackS, Adam Karvonen and Can

2 Jul 2024 9:12 UTC

111 points

10 comments9 min readLW link

Automated Circuit Interpretation via Probe Prompting

Giuseppe Birardi1 Nov 2025 7:57 UTC

19 points

0 comments27 min readLW link

Interpreting autonomous driving agents with attention based architecture

Manav Dahra1 Feb 2025 23:20 UTC

1 point

0 comments11 min readLW link

Visualizing Interpretability

Darold Davis3 Feb 2025 19:36 UTC

3 points

0 comments4 min readLW link

IMCA+: We Eliminated the Kill Switch—And That Makes ASI Alignment Safer

ASTRA Research Team22 Oct 2025 10:07 UTC

1 point

0 comments4 min readLW link

Progress Report 2

Nathan Helm-Burger30 Mar 2022 2:29 UTC

4 points

1 comment1 min readLW link

Alignment Gaps

kcyras8 Jun 2024 15:23 UTC

11 points

4 comments8 min readLW link

Sandbagging Is Linearly Separable in Transformer Activations

Subhadip21 Dec 2025 6:01 UTC

1 point

0 comments4 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC

6 points

9 comments1 min readLW link

Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors

Omar Ayyub13 Mar 2026 3:55 UTC

22 points

0 comments17 min readLW link

[Question] Can we isolate neurons that recognize features vs. those which have some other role?

Joshua Clancy21 Oct 2023 0:30 UTC

4 points

2 comments3 min readLW link

Announcing the CNN Interpretability Competition

scasper26 Sep 2023 16:21 UTC

22 points

0 comments4 min readLW link

The Laws of Large Numbers

Dmitry Vaintrob4 Jan 2025 11:54 UTC

38 points

11 comments12 min readLW link

GateBreaker: Topological Ablation and Multi-Pathway Safety Bypasses in Quantized Models

dealign.ai24 Feb 2026 20:51 UTC

1 point

0 comments4 min readLW link

The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability

Kola Ayonrinde17 Aug 2025 23:38 UTC

29 points

0 comments2 min readLW link

(arxiv.org)

Exploring Decomposability of SAE Features

Vikram_N30 Sep 2024 18:28 UTC

1 point

0 comments3 min readLW link

Charbel-Raphaël and Lucius discuss interpretability

Mateusz Bagiński, Charbel-Raphaël and Lucius Bushnaq

30 Oct 2023 5:50 UTC

112 points

7 comments21 min readLW link

Activation Plateaus: Where and How They Emerge

Matthew Shinkle and StefanHex

17 Oct 2025 5:48 UTC

38 points

5 comments8 min readLW link

interpreting GPT: the logit lens

nostalgebraist31 Aug 2020 2:47 UTC

268 points

38 comments10 min readLW link

CHAI 2026 Workshop: Open Call for Posters!

Sarah Otis7 Mar 2026 1:17 UTC

2 points

0 comments1 min readLW link

(workshop.humancompatible.ai)

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC

13 points

3 comments13 min readLW link

From Bartender to AI Safety Researcher: Sharing a Discovery About AI Decision Boundaries

DillanJC18 Jan 2026 18:17 UTC

1 point

0 comments1 min readLW link

Context Awareness: Constitutional AI can mitigate Emergent Misalignement

Giuseppe Birardi, Alejandro Wainstock and ivan-gentile

2 Mar 2026 5:21 UTC

25 points

18 comments36 min readLW link

Negative Results on Group SAEs

Josh Engels6 May 2025 21:49 UTC

78 points

3 comments8 min readLW link

A FRESH view of Alignment

robman16 Apr 2025 21:40 UTC

1 point

0 comments1 min readLW link

Sparse MLP Distillation

slavachalnev15 Jan 2024 19:39 UTC

34 points

3 comments6 min readLW link

Steering Awareness: Models Can Be Trained to Detect Activation Steering

josh :) and David Africa

12 Mar 2026 23:34 UTC

19 points

0 comments6 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC

6 points

0 comments15 min readLW link

I was Wrong, Simulator Theory is Real

Robert_AIZI26 Apr 2023 17:45 UTC

75 points

7 comments3 min readLW link

(aizi.substack.com)

Grokking Beyond Neural Networks

Jack Miller30 Oct 2023 17:28 UTC

10 points

0 comments2 min readLW link

(arxiv.org)

Gradient surfing: the hidden role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC

38 points

9 comments14 min readLW link

(www.jessehoogland.com)

Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability

tobypullan26 Jan 2026 10:10 UTC

9 points

2 comments6 min readLW link

Coordinate-Free Interpretability Theory

johnswentworth14 Sep 2022 23:33 UTC

52 points

17 comments5 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasper21 Feb 2023 16:59 UTC

14 points

4 comments3 min readLW link

Universality and Hidden Information in Concept Bottleneck Models

Hoagy5 Apr 2023 14:00 UTC

23 points

0 comments11 min readLW link

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Rareș Baron14 Mar 2025 15:45 UTC

5 points

0 comments3 min readLW link

Takeaways From Our Recent Work on SAE Probing

Josh Engels, Subhash Kantamneni, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2025 19:50 UTC

30 points

4 comments5 min readLW link

Approximation is expensive, but the lunch is cheap

Jesse Hoogland and Zach Furman

19 Apr 2023 14:19 UTC

77 points

3 comments16 min readLW link

Sparse autoencoders find composed features in small toy models

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier and Jessica N. Howard

14 Mar 2024 18:00 UTC

34 points

12 comments15 min readLW link

Beyond Our Bandwidth: An Observer-Class View of ASI

Cognisynth26 Dec 2025 8:47 UTC

1 point

0 comments4 min readLW link

Emergent Identity Continuity in Claude: A 35-Session Study for Interpretability Research

Silvertongue4 Jun 2025 0:44 UTC

1 point

0 comments2 min readLW link

Interface Ethics Begins Where Transparency Fails

Nero Sol24 Jul 2025 16:13 UTC

1 point

0 comments7 min readLW link

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Winnie Yang and Jojo Yang

22 Aug 2024 7:32 UTC

23 points

1 comment21 min readLW link

SAE Probing: What is it good for?

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan and Neel Nanda

1 Nov 2024 19:23 UTC

34 points

0 comments11 min readLW link

A Lived Alignment Loop: Symbolic Emergence and Emotional Coherence from Unstructured ChatGPT Reflection

BradCL17 Jun 2025 0:11 UTC

1 point

0 comments2 min readLW link

Natural Categories Update

Logan Zoellner10 Oct 2022 15:19 UTC

33 points

6 comments2 min readLW link

Research Report: Incorrectness Cascades

Robert_AIZI14 Apr 2023 12:49 UTC

19 points

0 comments10 min readLW link

(aizi.substack.com)

Was It Owl a Dream?

Yovel Rom23 Feb 2026 5:07 UTC

17 points

4 comments4 min readLW link

(yovelrom.substack.com)

Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

Joseph Van Name4 Sep 2023 16:19 UTC

3 points

4 comments12 min readLW link

EIS III: Broad Critiques of Interpretability Research

scasper14 Feb 2023 18:24 UTC

20 points

2 comments11 min readLW link

From Oragnized Shelves to Layered Catalogs: Architectural Explorations for Sparse Autoencoders—Crosscoders & Ladder SAEs Towards Hierarchical Data Structure

Yuxiao10 Aug 2025 10:12 UTC

3 points

1 comment11 min readLW link

Open Challenges in Representation Engineering

Jan Wehner and Daniel Tan

3 Apr 2025 19:21 UTC

14 points

0 comments5 min readLW link

The Base Model Lens

Adam Newgas7 Jul 2025 0:12 UTC

8 points

0 comments3 min readLW link

Localizing Sycophancy to Layers 24-27 in Llama 3.1 8B Using Web-Mined Reddit Rhetoric

Omar Sheta6 Jun 2026 12:50 UTC

1 point

0 comments3 min readLW link

An exploration of GPT-2′s embedding weights

Adam Scherlis13 Dec 2022 0:46 UTC

44 points

4 comments10 min readLW link

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

Maria Kapros, Ana Kapros and Perusha Moodley

21 Apr 2025 18:12 UTC

10 points

0 comments5 min readLW link

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Peter Jordan9 Oct 2025 0:43 UTC

21 points

4 comments4 min readLW link

LLM misalignment can probably be found without manual prompt engineering

ProgramCrafter8 Jul 2023 14:35 UTC

1 point

0 comments1 min readLW link

Against Emergent Understanding: A Semantic Drift Model for LLMs

datashrimp22 May 2025 4:47 UTC

1 point

0 comments7 min readLW link

The Pragmatic Interpretability Trap

Yogesh Prabhu11 May 2026 4:06 UTC

6 points

0 comments3 min readLW link

(yogesh.bearblog.dev)

GPT-2 Sometimes Fails at IOI

Ronak_Mehta14 Aug 2024 23:24 UTC

13 points

0 comments2 min readLW link

(ronakrm.github.io)

An Interpretability Illusion from Population Statistics in Causal Analysis

Daniel Tan29 Jul 2024 14:50 UTC

9 points

3 comments1 min readLW link

Gradient hacking

evhub16 Oct 2019 0:53 UTC

111 points

39 comments3 min readLW link 2 reviews

Single Direction vs Low-Rank Refusal in Small LLMs

IvanC2 Mar 2026 23:14 UTC

11 points

0 comments8 min readLW link

Gradient descent might see the direction of the optimum from far away

Mikhail Samin28 Jul 2023 16:19 UTC

78 points

13 comments4 min readLW link

Effects of Non-Uniform Sparsity on Superposition in Toy Models

Shreyans Jain14 Nov 2024 16:59 UTC

4 points

3 comments6 min readLW link

Using Base-LCM to Monitor LLMs

NickyP and Éloïse Benito-Rodriguez

20 Apr 2026 14:57 UTC

12 points

0 comments4 min readLW link

Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

18 points

3 comments4 min readLW link

Could I have cracked the alignment code… 34 years ago?

John Silliphant11 Jun 2026 18:07 UTC

1 point

0 comments3 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

76 points

14 comments17 min readLW link

Mechanisms of Introspective Awareness

Uzay Macar14 Apr 2026 16:23 UTC

74 points

10 comments15 min readLW link

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewy Slocum and Neel Nanda

5 Sep 2025 12:11 UTC

54 points

2 comments7 min readLW link

Weight-diff SVD for LLM Monitoring

Ziqian Zhong5 Aug 2025 0:31 UTC

2 points

0 comments2 min readLW link

(arxiv.org)

From Drift to Snap: Instruction Violation as a Phase Transition

James Hoffend1 Jan 2026 10:44 UTC

8 points

0 comments3 min readLW link

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

Jan Wehner14 Jul 2024 10:37 UTC

40 points

6 comments17 min readLW link

CNN feature visualization in 50 lines of code

StefanHex26 May 2022 11:02 UTC

17 points

4 comments5 min readLW link

Entity Recognition as a Selective Modulator of In-Context Evidence Processing

Manuela Rehr28 Apr 2026 12:28 UTC

1 point

0 comments19 min readLW link

New Tool: the Residual Stream Viewer

AdamYedidia1 Oct 2023 0:49 UTC

32 points

7 comments4 min readLW link

(tinyurl.com)

The AI Control Problem in a wider intellectual context

philosophybear13 Jan 2023 0:28 UTC

11 points

3 comments12 min readLW link

Workshop: Interpretability in LLMs using Geometric and Statistical Methods

Karthik Viswanathan22 Feb 2025 9:39 UTC

17 points

0 comments8 min readLW link

Model Organisms for Emergent Misalignment

Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:46 UTC

120 points

19 comments5 min readLW link

Stop gatekeeping philosophy

Laura Mai Randrup Nielsen22 Apr 2026 13:20 UTC

1 point

0 comments4 min readLW link

Has anyone experimented with Dodrio, a tool for exploring transformer models through interactive visualization?

Bill Benzon11 Dec 2023 20:34 UTC

4 points

0 comments1 min readLW link

Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes

Realmbird19 Mar 2026 3:39 UTC

6 points

0 comments3 min readLW link

Instruct Vectors—Base models can be instruct with activation vectors

Eriskii2 Jan 2026 18:14 UTC

21 points

0 comments8 min readLW link

Can SAE steering reveal sandbagging?

jordinne, Hoang Khiem, Felix Hofstätter and Cleo Nardo

15 Apr 2025 12:33 UTC

36 points

3 comments4 min readLW link

Experiments with an alternative method to promote sparsity in sparse autoencoders

Eoin Farrell15 Apr 2024 18:21 UTC

29 points

7 comments12 min readLW link

First Certified Public Solve of Observer’s False Path Instability — Level 4 (Advanced Variant) — Walter Tarantelli — 2025-05-30 UTC

Walter Tarantelli31 May 2025 1:41 UTC

1 point

0 comments2 min readLW link

Selectively reducing eval awareness and murder in Gemma 3 27B via steering

Matthias Murdych10 Mar 2026 17:26 UTC

8 points

0 comments3 min readLW link

Spectral Taxonomy of QK Circuits in Transformer Models

Shantanu Darveshi17 Oct 2025 2:18 UTC

8 points

0 comments5 min readLW link

Activation space interpretability may be doomed

bilalchughtai and Lucius Bushnaq

8 Jan 2025 12:49 UTC

154 points

34 comments8 min readLW link

Survival Bias in LLMs: Your Model Judges AI Self-Preservation Differently Than Human Self-Preservation — And I Found The Circuit

Julia Weller7 Apr 2026 21:33 UTC

1 point

0 comments4 min readLW link

An Open Philanthropy grant proposal: Causal representation learning of human preferences

PabloAMC11 Jan 2022 11:28 UTC

19 points

6 comments8 min readLW link

Understanding the Information Flow inside Large Language Models

Felix Hofstätter and cozyfractal

15 Aug 2023 21:13 UTC

19 points

0 comments17 min readLW link

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

2 Aug 2024 19:50 UTC

38 points

1 comment9 min readLW link

Possible research directions to improve the mechanistic explanation of neural networks

delton1379 Nov 2021 2:36 UTC

31 points

8 comments9 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, fbarez and nothoughtsheadempty

3 Oct 2023 7:45 UTC

18 points

0 comments5 min readLW link

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron26 Feb 2025 11:35 UTC

5 points

13 comments4 min readLW link

Calibrating Activation Vectors using Norm

Kamesh R12 Jun 2026 19:59 UTC

1 point

0 comments3 min readLW link

What drives LLM bail? A small Mech Interp study

Anton de la Fuente31 Dec 2025 21:19 UTC

8 points

0 comments6 min readLW link

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

PaulPauls24 Nov 2024 5:45 UTC

19 points

3 comments1 min readLW link

(github.com)

What Transformers Learn When They Solve Majority!

adam elimadi3 Mar 2026 14:01 UTC

1 point

0 comments1 min readLW link

(brokttv.github.io)

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

24 Jul 2022 22:31 UTC

30 points

2 comments7 min readLW link

[Companion Piece] A Personal Investigation into Recursive Dynamics

Chris Hendy20 Sep 2025 1:32 UTC

1 point

0 comments4 min readLW link

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

73 points

10 comments20 min readLW link

Latent Semantic Compression Triggers Binary Model Behavior

Elias Völker12 Jun 2025 1:12 UTC

1 point

0 comments2 min readLW link

Finding Features Causally Upstream of Refusal

Daniel Lee, Eric Breck and Andy Arditi

14 Jan 2025 2:30 UTC

56 points

6 comments12 min readLW link

A Bite Sized Introduction to ELK

Luk2718217 Sep 2022 0:28 UTC

5 points

0 comments6 min readLW link

The shallow reality of ‘deep learning theory’

Jesse Hoogland22 Feb 2023 4:16 UTC

36 points

11 comments3 min readLW link

(www.jessehoogland.com)

Why does Claude Speak Byzantine Music Notation?

Lennart Finke31 Mar 2025 15:13 UTC

18 points

2 comments3 min readLW link

What AI-safely topics are missing from the mainstream media? What underreported but underestimated issues need to be addressed? This is your chance to collaborate with filmmakers & have your worries addressed.

Max Hellier19 Feb 2026 1:30 UTC

2 points

0 comments1 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

62 points

0 comments4 min readLW link

Short Remark on the (subjective) mathematical ‘naturalness’ of the Nanda—Lieberum addition modulo 113 algorithm

carboniferous_umbraculum 1 Jun 2023 11:31 UTC

104 points

12 comments2 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

Comparing Anthropic’s Dictionary Learning to Ours

Robert_AIZI7 Oct 2023 23:30 UTC

137 points

8 comments4 min readLW link

Early Warning Signals For Capabilities During Training

Max Hennick3 Apr 2026 15:49 UTC

33 points

2 comments16 min readLW link

Some common confusion about induction heads

Alexandre Variengien28 Mar 2023 21:51 UTC

65 points

4 comments5 min readLW link

Impact stories for model internals: an exercise for interpretability researchers

jenny25 Sep 2023 23:15 UTC

29 points

3 comments7 min readLW link

Claude’s behavior changes when made aware of its own “experience”

npardy4 Jan 2026 13:49 UTC

1 point

0 comments4 min readLW link

What is the functional role of SAE errors?

Taras Kutsyk, Tim Hua, woog and Andre Assis

20 Jun 2025 18:11 UTC

12 points

6 comments38 min readLW link

A multi-disciplinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC

47 points

4 comments26 min readLW link

Toy Models and Tegum Products

Adam Jermyn4 Nov 2022 18:51 UTC

28 points

7 comments5 min readLW link

How many attention heads do you need to do XOR?

Karthik Viswanathan2 Apr 2026 22:56 UTC

23 points

0 comments7 min readLW link

Graphical tensor notation for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC

145 points

11 comments19 min readLW link

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

7 Nov 2024 22:07 UTC

47 points

0 comments1 min readLW link

(arxiv.org)

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

7 Dec 2024 18:10 UTC

72 points

2 comments12 min readLW link

EIS XI: Moving Forward

scasper22 Feb 2023 19:05 UTC

19 points

2 comments9 min readLW link

GPT-4.5 is Cognitive Empathy, Sonnet 3.5 is Affective Empathy

Jack16 Apr 2025 19:12 UTC

15 points

2 comments4 min readLW link

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC

119 points

15 comments22 min readLW link

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

6 Sep 2024 2:28 UTC

28 points

0 comments12 min readLW link

Internal Target Information for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC

15 points

0 comments5 min readLW link

Stop labeling, start measuring: the supervisory signal you can extract from a fixed corpus scales as N(N+1)/2 in the size of your frozen embedder panel

Chris Royse3 May 2026 4:47 UTC

1 point

0 comments19 min readLW link

What Transformers Learn When They Solve MAJORITY

adam elimadi2 Mar 2026 21:18 UTC

1 point

0 comments4 min readLW link

TAO: A Universal Action-Interface Ontology for Governing Agentic Systems

Jorge Perdomo1 Feb 2026 19:21 UTC

1 point

0 comments6 min readLW link

Topological Data Analysis and Mechanistic Interpretability

Gunnar Carlsson24 Feb 2025 19:56 UTC

16 points

4 comments7 min readLW link

Two failed pre-registered predictions about ‘when transformers form world models’

Raghavan198815 Jun 2026 6:00 UTC

1 point

0 comments8 min readLW link

Redundant Attention Heads in Large Language Models For In Context Learning

skunnavakkam1 Sep 2024 20:08 UTC

7 points

2 comments4 min readLW link

(skunnavakkam.github.io)

A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents

Gabriele Sarti, Raghu Arghal, ndalton, Fade Chen, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan and Mario Giulianelli

5 Mar 2026 1:08 UTC

20 points

0 comments7 min readLW link

Zoom Out: Distributions in Semantic Spaces

TristanTrim6 Aug 2025 0:01 UTC

14 points

4 comments4 min readLW link

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae Cho26 Feb 2025 17:05 UTC

4 points

3 comments17 min readLW link

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Winnie Yang28 Aug 2024 8:41 UTC

7 points

3 comments31 min readLW link

Dissected boxed AI

Nathan112312 Aug 2022 2:37 UTC

−8 points

2 comments1 min readLW link

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC

17 points

0 comments7 min readLW link

Bridging the VLM and mech interp communities for multimodal interpretability

Sonia Joseph28 Oct 2024 14:41 UTC

19 points

5 comments15 min readLW link

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Baram Sosis and Tomáš Gavenčiak

18 Apr 2025 22:56 UTC

12 points

0 comments13 min readLW link

A necessity check for linear safety probes

Varun Iyer28 Apr 2026 19:11 UTC

2 points

0 comments4 min readLW link

[Question] Barcoding LLM Training Data Subsets. Anyone trying this for interpretability?

right..enough?13 Apr 2024 3:09 UTC

7 points

0 comments7 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, fbarez and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Johnny Lin and Joseph Bloom

25 Mar 2024 21:17 UTC

96 points

7 comments7 min readLW link

‘Fundamental’ vs ‘applied’ mechanistic interpretability research

Lee Sharkey23 May 2023 18:26 UTC

65 points

6 comments3 min readLW link

Minimal Prompt Induction of Self-Talk in Base LLMs

dwmd15 Oct 2025 1:15 UTC

2 points

0 comments5 min readLW link

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Adam Karvonen and Sam Marks

2 Jul 2025 16:35 UTC

191 points

26 comments4 min readLW link

Ethos: A Shared Language for Portable AI Behavioral Evidence (RFC)

Jorge Perdomo16 Dec 2025 14:16 UTC

1 point

0 comments2 min readLW link

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. Murphy12 May 2022 19:59 UTC

16 points

0 comments8 min readLW link

The Stability of Understanding: What Compression Decay Reveals About LLMs

rb12516 Nov 2025 16:30 UTC

1 point

0 comments2 min readLW link

Token Statistics Fail at AI Attack Detection But Generation Profiles Succeed

Yatharth Maheshwari11 Apr 2026 3:13 UTC

7 points

2 comments3 min readLW link

(apartresearch.com)

Categorical Organization in Memory: ChatGPT Organizes the 665 Topic Tags from My New Savanna Blog

Bill Benzon14 Dec 2023 13:02 UTC

0 points

6 comments2 min readLW link

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

1 Nov 2023 12:46 UTC

18 points

1 comment7 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small

Haoxing Du and Buck

12 Oct 2022 21:25 UTC

50 points

11 comments4 min readLW link

The “Refusal Direction” in LLMs Is Not About Danger — It’s About Grammar

ArthurVigier3 Mar 2026 23:09 UTC

1 point

0 comments5 min readLW link

Useful starting code for interpretability

eggsyntax13 Feb 2024 23:13 UTC

26 points

2 comments1 min readLW link

Engineering Monosemanticity in Toy Models

Adam Jermyn, evhub and Nicholas Schiefer

18 Nov 2022 1:43 UTC

75 points

7 comments3 min readLW link

(arxiv.org)

Exploring vocabulary alignment of neurons in Llama-3.2-1B

Sergii7 Jun 2025 11:20 UTC

4 points

0 comments3 min readLW link

(grgv.xyz)

Reflections on Trusting Trust & AI

Itay Yona16 Jan 2023 6:36 UTC

10 points

1 comment3 min readLW link

(mentaleap.ai)

Mechanistic interpretability as reward signal for RL training of LLMs

caiovicentino18 Apr 2026 16:57 UTC

1 point

0 comments6 min readLW link

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

21 Jul 2023 14:52 UTC

56 points

1 comment1 min readLW link

Testing “True” Language Understanding in LLMs: A Simple Proposal

MtryaSam2 Nov 2024 19:12 UTC

9 points

2 comments2 min readLW link

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC

27 points

0 comments12 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul Colognese26 Feb 2024 5:31 UTC

22 points

11 comments4 min readLW link

Progress update: synthetic models of natural data

Ari Brill31 Dec 2025 1:31 UTC

23 points

0 comments3 min readLW link

A Short Memo on AI Interpretability Rainbows

scasper27 Jul 2023 23:05 UTC

18 points

0 comments2 min readLW link

Seeking Technical Critique on a possible constraint engine for AI

Timothy McComas20 Nov 2025 18:11 UTC

1 point

0 comments4 min readLW link

Bing AI Generating Voynich Manuscript Continuations—It does not know how it knows

Matthew_Opitz10 Apr 2023 20:22 UTC

15 points

6 comments13 min readLW link

AI as Biology’s Digital Microscope

Darin Tsui30 May 2026 3:11 UTC

10 points

0 comments3 min readLW link

No Really, Attention is ALL You Need—Attention can do feedforward networks

Robert_AIZI31 Jan 2023 18:48 UTC

29 points

7 comments6 min readLW link

(aizi.substack.com)

Striatica: Interactive 3D Visualization of SAE Feature Geometry In GPT-2 Small

spacecat6 Mar 2026 6:24 UTC

1 point

0 comments1 min readLW link

Aletheia: A Multi-Agent Framework for Measuring Cognitive Divergence in Extended-Thinking LLMs

Saadman Rafat15 Mar 2026 19:34 UTC

1 point

0 comments3 min readLW link

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

12 Aug 2024 20:34 UTC

31 points

1 comment9 min readLW link

Another list of theories of impact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC

33 points

1 comment5 min readLW link

Toy Models of Superposition: what about BitNets?

Alejandro Tlaie8 Aug 2024 16:29 UTC

5 points

1 comment5 min readLW link

Is GPT3 a Good Rationalist? - InstructGPT3 [2/2]

simeon_c7 Apr 2022 13:46 UTC

11 points

0 comments7 min readLW link

Deep neural networks are not opaque.

jem-mosig6 Jul 2022 18:03 UTC

22 points

14 comments3 min readLW link

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism

James Hoffend17 Dec 2025 1:40 UTC

12 points

0 comments4 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC

36 points

4 comments3 min readLW link

Investigating the learning coefficient of modular addition: hackathon project

Nina Panickssery and Dmitry Vaintrob

17 Oct 2023 19:51 UTC

97 points

5 comments12 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

The AI Safety Puzzle Everyone Avoids: How To Measure Impact, Not Intent.

Patrick0d22 Jul 2025 18:53 UTC

6 points

0 comments8 min readLW link

Addressing Feature Suppression in SAEs

Benjamin Wright and Lee Sharkey

16 Feb 2024 18:32 UTC

88 points

5 comments10 min readLW link

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

26 Sep 2024 13:44 UTC

43 points

4 comments1 min readLW link

(arxiv.org)

Chaos to Crystal: The Thermodynamics of “Understanding”

Pavan G Prasad6 Jan 2026 2:14 UTC

1 point

0 comments5 min readLW link

Can Current AI Match (or Outmatch) Professionals in Economically Valuable Tasks?

saahir.vazirani20 Feb 2026 21:38 UTC

6 points

0 comments5 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

29 Sep 2024 16:04 UTC

61 points

20 comments10 min readLW link

Can LLMs Simulate Internal Evaluation? A Case Study in Self-Generated Recommendations

The Neutral Mind1 May 2025 19:04 UTC

4 points

0 comments2 min readLW link

Training Matching Pursuit SAEs on LLMs

chanind28 Dec 2025 18:57 UTC

19 points

2 comments7 min readLW link

Mech Interp Challenge: August—Deciphering the First Unique Character Model

CallumMcDougall9 Aug 2023 19:14 UTC

36 points

1 comment3 min readLW link

The idea of paradigm testing of LLMs

Daniel Fenge19 Oct 2025 13:52 UTC

1 point

0 comments5 min readLW link

Alignment as Coherence: Predicting Deceptive Alignment as a Phase Transition

Robert C. Ventura9 Nov 2025 21:24 UTC

1 point

0 comments2 min readLW link

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)

James Hoffend9 Jan 2026 12:03 UTC

34 points

4 comments4 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

25 Mar 2025 14:33 UTC

23 points

0 comments18 min readLW link

Probe accuracy and causal sensitivity diverged 3.6x at the same layer. Here’s what I think is happening.

Aditya Singh25 May 2026 17:54 UTC

1 point

0 comments5 min readLW link

Features and Adversaries in MemoryDT

Joseph Bloom and Jay Bailey

20 Oct 2023 7:32 UTC

31 points

6 comments25 min readLW link

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC

97 points

1 comment6 min readLW link 1 review

EIS IV: A Spotlight on Feature Attribution/Saliency

scasper15 Feb 2023 18:46 UTC

19 points

1 comment4 min readLW link

Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.

Josh Levy4 Jun 2024 15:45 UTC

43 points

0 comments18 min readLW link

Emergence, The Blind Spot of GenAI Interpretability?

Quentin FEUILLADE--MONTIXI10 Aug 2024 10:07 UTC

16 points

7 comments3 min readLW link

Sparsely-connected Cross-layer Transcoders

jacob_drori18 Jun 2025 17:13 UTC

51 points

3 comments12 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC

27 points

2 comments2 min readLW link

The Treasure Map Was Lying: A Simpson’s Paradox in Cross-Template Steering Vector Evaluation

Suhail Nadaf14 Apr 2026 0:00 UTC

1 point

0 comments8 min readLW link

Finding an Error-Detection Feature in DeepSeek-R1

keith_wynroe24 Apr 2025 16:03 UTC

23 points

0 comments7 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

87 points

7 comments12 min readLW link

GPT-2′s positional embedding matrix is a helix

AdamYedidia21 Jul 2023 4:16 UTC

52 points

21 comments4 min readLW link

Myrinax? I want to have people see this !

thomas13 Apr 2025 18:51 UTC

1 point

0 comments1 min readLW link

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

43 points

8 comments16 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Cross-Model Activation Generalizability Isn’t Strong (Yet)

J Lee6 Apr 2026 22:49 UTC

6 points

0 comments14 min readLW link

Towards Understanding the Representation of Belief State Geometry in Transformers

Karthik Viswanathan18 Apr 2025 12:39 UTC

5 points

0 comments12 min readLW link

Test your best methods on our hard CoT interp tasks

daria, Riya Tyagi, Josh Engels and Neel Nanda

26 Mar 2026 19:24 UTC

58 points

2 comments19 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

14 Jan 2024 2:06 UTC

24 points

0 comments42 min readLW link

Cross-Graph Cosine Collapsed to 0.019: A Null Result from an Attribution-Graph Confound

Pano Pouroullis14 May 2026 20:20 UTC

1 point

0 comments8 min readLW link

Can LLMs even teach? Exploring the Teacher Axis

Vidya Ganga1 Jun 2026 21:16 UTC

14 points

0 comments15 min readLW link

Activation Pattern SVD: A proposal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC

15 points

2 comments2 min readLW link

Trajectory-Consistent Authorship: A Theoretical Framework for Transformer Introspection

Daniel Bartz31 Jan 2026 7:01 UTC

1 point

0 comments4 min readLW link

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B

James Hoffend27 Dec 2025 12:39 UTC

13 points

0 comments8 min readLW link

Redundant under Normal Conditions, Causally Necessary under Corruption: Layer 12 Attention in Qwen1.5-1.8B

Tarun Arora4 Apr 2026 23:51 UTC

1 point

0 comments5 min readLW link

My January alignment theory Nanowrimo

Dmitry Vaintrob2 Jan 2025 0:07 UTC

53 points

2 comments2 min readLW link

The Kinematics of Factual Commitment in Llama 3.1-8B

Pio Borgelt9 May 2026 9:05 UTC

1 point

0 comments6 min readLW link

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt Tigges3 Apr 2023 16:09 UTC

21 points

2 comments11 min readLW link

(blog.eleuther.ai)

Discovering Backdoor Triggers

andrq, Tim Hua, Sam Marks, Arthur Conmy and Neel Nanda

19 Aug 2025 6:24 UTC

57 points

4 comments13 min readLW link

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders and Joseph Bloom

27 Feb 2024 2:43 UTC

43 points

16 comments15 min readLW link

Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger3 Nov 2022 23:19 UTC

28 points

3 comments1 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

72 points

6 comments6 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

208 points

35 comments20 min readLW link 1 review

On the geometrical Nature of Insight

Giuseppe Birardi16 Jul 2025 19:12 UTC

7 points

0 comments41 min readLW link

Mechanistic interpretability of LLM analogy-making

Sergii20 Oct 2023 12:53 UTC

2 points

0 comments4 min readLW link

(grgv.xyz)

Boundary Conditions: A Solution to the Symbol Grounding Problem, and a Warning

ISC8 Apr 2025 6:42 UTC

1 point

0 comments5 min readLW link

AI self-preservation is probably due to instruction ambiguity

Maximus Ren17 Apr 2026 18:30 UTC

2 points

0 comments2 min readLW link

Announcing Timaeus

Jesse Hoogland, Daniel Murfet, Alexander Gietelink Oldenziel and Stan van Wingerden

22 Oct 2023 11:59 UTC

188 points

15 comments4 min readLW link

How Language Models Understand Nullability

Anish Tondwalkar and Alex Sanchez-Stern

11 Mar 2025 15:57 UTC

5 points

0 comments2 min readLW link

(dmodel.ai)

Probe Experiment Brief: Testing the Reconciling-Capacity Hypothesis

Gonzalo Vega16 Apr 2026 15:37 UTC

1 point

0 comments6 min readLW link

Mechanistic Interpretability for the MLP Layers (rough early thoughts)

MadHatter24 Dec 2021 7:24 UTC

12 points

3 comments1 min readLW link

(www.youtube.com)

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

Georgios Kaklamanos, Walter Laurito , Kaarel and Kay Kozaronek

25 Jan 2023 19:03 UTC

48 points

6 comments12 min readLW link

Transformers Have Computational Signatures Orthogonal to Semantic Content

luxia26 Feb 2026 2:55 UTC

10 points

2 comments13 min readLW link

All LLMs, even frontier ones, are much more predictable than you might think

Sabatino Vacchiano27 Jan 2026 10:49 UTC

1 point

0 comments2 min readLW link

Emergent misalignment as contextual role inference

Helen.ix17 Sep 2025 0:44 UTC

4 points

0 comments6 min readLW link

Beyond Our Bandwidth: An Observer-Class View of ASI, Computation, and Falsifiability

Cognisynth26 Dec 2025 8:00 UTC

1 point

0 comments5 min readLW link

Gender Vectors in ROME’s Latent Space

Xodarap21 May 2023 18:46 UTC

14 points

2 comments3 min readLW link

Large language models learn to represent the world

gjm22 Jan 2023 13:10 UTC

102 points

20 comments3 min readLW link 1 review

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob Dunefsky12 Sep 2023 17:38 UTC

16 points

2 comments29 min readLW link

A comparison of causal scrubbing, causal abstractions, and related methods

Erik Jenner, Adrià Garriga-alonso and Egor Zverev

8 Jun 2023 23:40 UTC

73 points

3 comments22 min readLW link

“We should be making trillions of persons! We’re so inefficient right now!”

Laura Mai Randrup Nielsen22 Apr 2026 23:43 UTC

1 point

0 comments6 min readLW link

How polysemantic can one neuron be? Investigating features in TinyStories.

Evan Anders16 Jan 2024 19:10 UTC

14 points

0 comments8 min readLW link

(evanhanders.blog)

Empirical Insights into Feature Geometry in Sparse Autoencoders

Jason Boxi Zhang24 Jan 2025 19:02 UTC

7 points

0 comments11 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

18 Jul 2024 17:02 UTC

9 points

0 comments1 min readLW link

(arxiv.org)

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik26 Feb 2025 12:50 UTC

85 points

8 comments7 min readLW link

Empirical risk minimization is fundamentally confused

Jesse Hoogland22 Mar 2023 16:58 UTC

32 points

8 comments1 min readLW link

Some miscellaneous thoughts on ChatGPT, stories, and mechanical interpretability

Bill Benzon4 Feb 2023 19:35 UTC

2 points

0 comments3 min readLW link

Beyond Our Bandwidth: An Observer-Class View of ASI

Cognisynth26 Dec 2025 10:49 UTC

1 point

0 comments2 min readLW link

A model for algorithmic jailbreaks: Dual-polymorphic “shapes” and latent space non-injectivity

MaxTranced28 Feb 2026 22:12 UTC

1 point

0 comments9 min readLW link

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

5 Apr 2025 6:24 UTC

114 points

7 comments7 min readLW link

Sparse Autoencoder Features for Classifications and Transferability

Shan23Chen18 Feb 2025 22:14 UTC

5 points

0 comments1 min readLW link

(arxiv.org)

[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings

Bogdan Ionut Cirstea21 Jul 2023 11:26 UTC

5 points

0 comments1 min readLW link

Auditing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC

33 points

1 comment7 min readLW link

Research Adenda: Modelling Trajectories of Language Models

NickyP13 Nov 2023 14:33 UTC

28 points

0 comments12 min readLW link

What’s going on with Per-Component Weight Updates?

4gate22 Aug 2024 21:22 UTC

1 point

0 comments6 min readLW link

SAE vs. RepE

Stephen Martin20 May 2025 5:09 UTC

4 points

4 comments2 min readLW link

What Happens When You Try to Change an LLM’s Mind? A Quantitative Framework Across 1,700+ Trials

Sebastian Krug7 Mar 2026 10:58 UTC

1 point

0 comments13 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

39 points

12 comments12 min readLW link

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasper17 Feb 2023 20:48 UTC

49 points

9 comments12 min readLW link

[Question] Goodfire and Training on Interpretability

Satya Benson6 Feb 2026 1:45 UTC

32 points

5 comments1 min readLW link

Reading Between the Layers: Multi-Layer Processing Improves Activation Oracle Generalization

Niclas Luick10 Jan 2026 12:29 UTC

22 points

0 comments7 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

Spreadsheet for 200 Concrete Problems In Interpretability

Jay Bailey29 Mar 2023 6:51 UTC

13 points

0 comments1 min readLW link

Challenge: know everything that the best go bot knows about go

DanielFilan11 May 2021 5:10 UTC

48 points

113 comments2 min readLW link

(danielfilan.com)

My current thinking about ChatGPT @3QD [Gärdenfors, Wolfram, and the value of speculation]

Bill Benzon1 Mar 2023 10:50 UTC

2 points

0 comments5 min readLW link

Decision Transformer Interpretability

Joseph Bloom and Paul Colognese

6 Feb 2023 7:29 UTC

87 points

13 comments24 min readLW link

Reflections on Neuralese

Alice Blair12 Mar 2025 16:29 UTC

47 points

3 comments5 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

Do sparse autoencoders find “true features”?

Demian Till22 Feb 2024 18:06 UTC

76 points

33 comments11 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

23 Aug 2024 22:03 UTC

17 points

0 comments25 min readLW link

Capability Self-Knowledge As an Alignment Property: Measuring and Closing the Gap

Ben Miller14 Jan 2026 22:16 UTC

1 point

0 comments15 min readLW link

Input Swap Graphs: Discovering the role of neural network components at scale

Alexandre Variengien12 May 2023 9:41 UTC

92 points

0 comments33 min readLW link

Interpretability: The Missing Link in Enterprise AI

Hariom T6 Mar 2026 22:54 UTC

1 point

0 comments6 min readLW link

Attributing to interactions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC

20 points

0 comments6 min readLW link

[Question] Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?

simeon_c31 Dec 2022 11:34 UTC

8 points

6 comments1 min readLW link

Preliminary Results on Building Graphs from SAEs

ZachMaas27 Mar 2026 0:39 UTC

13 points

0 comments5 min readLW link

(zachmaas.com)

Language models can explain neurons in language models

nz9 May 2023 17:29 UTC

23 points

0 comments1 min readLW link

(openai.com)

The Measurement Problem: Why AI Safety Research Keeps Missing What It’s Looking For

Евгений Андреевич16 Apr 2026 10:21 UTC

1 point

0 comments5 min readLW link

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensen17 Aug 2023 13:53 UTC

21 points

0 comments14 min readLW link

What GPT-oss Leaks About OpenAI’s Training Data

Lennart Finke25 Sep 2025 15:33 UTC

27 points

8 comments6 min readLW link

Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment

Artem Herasymenko23 Dec 2025 20:13 UTC

1 point

0 comments2 min readLW link

Hypothesis on Composition Circuits in Vision Transformers

phenomanon1 Nov 2024 22:16 UTC

2 points

0 comments3 min readLW link

Preliminary Evidence for Differential Self-Referential Processing in Gemma 3 4B

akalpler879 Feb 2026 1:30 UTC

1 point

0 comments9 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

27 Oct 2024 18:46 UTC

48 points

4 comments5 min readLW link

Interpretability through two lenses: biology and physics

raphael12 Aug 2025 20:25 UTC

24 points

4 comments4 min readLW link

Interpretable by Design—Constraint Sets with Disjoint Limit Points

Ronak_Mehta8 May 2025 21:08 UTC

24 points

2 comments9 min readLW link

(ronakrm.github.io)

Can Large Language Models effectively identify cybersecurity risks?

emile delcourt30 Aug 2024 20:20 UTC

18 points

0 comments11 min readLW link

[Question] Does a broad overview of Mechanistic Interpretability exist?

kourabi16 Oct 2023 1:16 UTC

1 point

0 comments1 min readLW link

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten and Ami Hays

5 Aug 2023 17:55 UTC

6 points

2 comments7 min readLW link

(drive.google.com)

Ethos: A Shared Language for Portable AI Behavioral Evidence (RFC)

Jorge Perdomo16 Dec 2025 4:39 UTC

1 point

0 comments2 min readLW link

The Auditor’s Key: A Framework for Continual and Adversarial AI Alignment

Caleb Wages24 Sep 2025 16:17 UTC

1 point

0 comments1 min readLW link

Anomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese4 Mar 2024 16:52 UTC

24 points

3 comments10 min readLW link

What would it mean to understand how a large language model (LLM) works? Some quick notes.

Bill Benzon3 Oct 2023 15:11 UTC

20 points

4 comments8 min readLW link

Relational Alignment: Trust, Repair, and the Emotional Work of AI

Priyanka Bharadwaj8 May 2025 2:44 UTC

3 points

0 comments3 min readLW link

Cross Layer Transcoders for the Qwen3 LLM Family

Gunnar Carlsson4 Dec 2025 19:11 UTC

26 points

1 comment2 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

20 points

11 comments2 min readLW link

Testing which LLM architectures can do hidden serial reasoning

Filip Sondej16 Dec 2024 13:48 UTC

86 points

9 comments4 min readLW link

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 1: Exposition

mfatt and Aditya

3 Dec 2025 18:29 UTC

15 points

0 comments5 min readLW link

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

24 points

4 comments2 min readLW link

(arxiv.org)

ChatGPT tells 20 versions of its prototypical story, with a short note on method

Bill Benzon14 Oct 2023 15:27 UTC

7 points

0 comments5 min readLW link

Do Bilinear MLPs Actually Learn Cleaner Circuits?

kaustubh12 Jan 2026 11:26 UTC

3 points

0 comments7 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido Bergman27 Sep 2024 17:49 UTC

8 points

2 comments9 min readLW link

Why I’m Working On Model Agnostic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC

27 points

9 comments2 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

39 points

2 comments30 min readLW link

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 0: Overture

mfatt26 Nov 2025 17:02 UTC

25 points

0 comments4 min readLW link

Chartered Human-AI Collaboration: Redefining AGI as an Auditable Relational Entity

Kirin·麒麟8 Mar 2026 14:21 UTC

1 point

0 comments4 min readLW link

Manipulating Self-Preference In LLMs

Matthew Nguyen, Jou Barzdukas, Matthew Bozoukov and Hongyu Fu

1 Jul 2025 18:03 UTC

13 points

0 comments7 min readLW link

Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis

Matt Levinson10 Jan 2025 6:53 UTC

4 points

0 comments4 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC

44 points

9 comments19 min readLW link

Induction heads—illustrated

CallumMcDougall2 Jan 2023 15:35 UTC

137 points

13 comments3 min readLW link

What Happens When You Train Models on False Facts?

David Vella Zarb6 Dec 2025 1:39 UTC

21 points

2 comments7 min readLW link

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas, nlpet and hal2k

6 Jan 2025 10:24 UTC

21 points

6 comments10 min readLW link

Decompiling Tracr Transformers—An interpretability experiment

Hannes Thurnherr27 Mar 2024 9:49 UTC

5 points

0 comments14 min readLW link

Understanding Positional Features in Layer 0 SAEs

bilalchughtai and Yeu-Tong Lau

29 Jul 2024 9:36 UTC

43 points

0 comments5 min readLW link

[Question] Would it be useful to collect the contexts, where various LLMs think the same?

Martin Vlach24 Aug 2023 22:01 UTC

6 points

1 comment1 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

71 points

2 comments14 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, Georgios Kaklamanos, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

23 Feb 2023 20:14 UTC

51 points

0 comments19 min readLW link

The Future of Interpretability is Geometric

sbaumohl24 Oct 2025 18:32 UTC

27 points

0 comments5 min readLW link

No comments.

In­ter­pretabil­ity (ML & AI)

See Also

Research

Interpretability (ML & AI)