RSS

Model Diffing

TagLast edit: 3 Feb 2026 20:31 UTC by Arthur Conmy

“Model diffing” is a phrase introduced in https://​​www.alignmentforum.org/​​posts/​​X2i9dQQK3gETCyqh2/​​chris-olah-s-views-on-agi-safety and popularised by the Anthropic interpretability team in https://​​transformer-circuits.pub/​​2024/​​crosscoders/​​index.html#model-diffing—it refers there to “Just as we review software in terms of incremental diffs, you might hope to review the safety of models by focusing on how it has changed from a previously deployed model [...] to isolate and interpret these changes”. There are many other machine learning papers on similar diffing techniques, e.g. https://​​arxiv.org/​​abs/​​2211.12491


Model diffing may refer specifically to the study of mechanistic changes introduced during fine-tuning; understanding what makes a fine-tuned model different from its base model internally.

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

6 Jan 2025 4:22 UTC
19 points
0 comments12 min readLW link

What We Learned Try­ing to Diff Base and Chat Models (And Why It Mat­ters)

30 Jun 2025 17:17 UTC
106 points
2 comments7 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
49 points
4 comments5 min readLW link

Tied Cross­coders: Ex­plain­ing Chat Be­hav­ior from Base Model

Santiago Aranguri22 Mar 2025 18:07 UTC
9 points
0 comments12 min readLW link

[Re­search sprint] Sin­gle-model cross­coder fea­ture ab­la­tion and steering

Thomas Read6 Apr 2025 14:42 UTC
10 points
0 comments12 min readLW link

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

22 Mar 2025 18:35 UTC
25 points
0 comments7 min readLW link

Nar­row Fine­tun­ing Leaves Clearly Read­able Traces in Ac­ti­va­tion Differences

5 Sep 2025 12:11 UTC
53 points
2 comments7 min readLW link

SAE on ac­ti­va­tion differences

30 Jun 2025 17:50 UTC
45 points
3 comments5 min readLW link
No comments.