RSS

Model Diffing

TagLast edit: 30 Jun 2025 17:22 UTC by Clément Dumas

Model diffing is the study of mechanistic changes introduced during fine-tuning—essentially, understanding what makes a fine-tuned model different from its base model internally.

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

6 Jan 2025 4:22 UTC
19 points
0 comments12 min readLW link

What We Learned Try­ing to Diff Base and Chat Models (And Why It Mat­ters)

30 Jun 2025 17:17 UTC
105 points
2 comments7 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

27 Oct 2024 18:46 UTC
48 points
4 comments5 min readLW link

Tied Cross­coders: Ex­plain­ing Chat Be­hav­ior from Base Model

Santiago Aranguri22 Mar 2025 18:07 UTC
9 points
0 comments12 min readLW link

[Re­search sprint] Sin­gle-model cross­coder fea­ture ab­la­tion and steering

Thomas Read6 Apr 2025 14:42 UTC
8 points
0 comments12 min readLW link

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

22 Mar 2025 18:35 UTC
19 points
0 comments7 min readLW link

Nar­row Fine­tun­ing Leaves Clearly Read­able Traces in Ac­ti­va­tion Differences

5 Sep 2025 12:11 UTC
50 points
2 comments7 min readLW link

SAE on ac­ti­va­tion differences

30 Jun 2025 17:50 UTC
44 points
3 comments5 min readLW link
No comments.