The obvious targets are of course Anthropic’s own frontier models, Claude Instant and Claude 2.
Problem setup: what makes a good decomposition? discusses what success might look like and enable—but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we’d have plenty left to do, unraveling circuits and building a larger-scale understanding of models.
The obvious targets are of course Anthropic’s own frontier models, Claude Instant and Claude 2.
Problem setup: what makes a good decomposition? discusses what success might look like and enable—but note that decomposing models into components is just the beginning of the work of mechanistic interpretability! Even with perfect decomposition we’d have plenty left to do, unraveling circuits and building a larger-scale understanding of models.