Model Diffing

TagLast edit: 3 Feb 2026 20:31 UTC by Arthur Conmy

“Model diffing” is a phrase introduced in https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://transformer-circuits.pub/2024/crosscoders/index.html#model-diffing—it refers there to “Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes”. There are many other machine learning papers on similar diffing techniques, e.g. https://arxiv.org/abs/2211.12491

Model diffing may refer specifically to the study of mechanistic changes introduced during fine-tuning; understanding what makes a fine-tuned model different from its base model internally.

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

6 Jan 2025 4:22 UTC

19 points

0 comments12 min readLW link

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas, Julian Minder and Neel Nanda

30 Jun 2025 17:17 UTC

106 points

2 comments7 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

27 Oct 2024 18:46 UTC

48 points

4 comments5 min readLW link

Tied Crosscoders: Explaining Chat Behavior from Base Model

Santiago Aranguri22 Mar 2025 18:07 UTC

9 points

0 comments12 min readLW link

[Research sprint] Single-model crosscoder feature ablation and steering

Thomas Read6 Apr 2025 14:42 UTC

11 points

0 comments12 min readLW link

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree and Jason Gross

22 Mar 2025 18:35 UTC

25 points

0 comments7 min readLW link

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewy Slocum and Neel Nanda

5 Sep 2025 12:11 UTC

54 points

2 comments7 min readLW link

SAE on activation differences

Santiago Aranguri, jacob_drori and Neel Nanda

30 Jun 2025 17:50 UTC

45 points

3 comments5 min readLW link

No comments.