Zac Hatfield-Dodds comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zac Hatfield-Dodds 6 Oct 2023 19:59 UTC
LW: 6 AF: 3
2
AF
The obvious targets are of course Anthropic’s own frontier models, Claude Instant and Claude 2.

Problem setup: what makes a good decomposition? discusses what success might look like and enable—but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we’d have plenty left to do, unraveling circuits and building a larger-scale understanding of models.