I mostly agree with this analysis. But I think there are better safety cases for interp beyond enumeration of features. As you say, there might be shallow copies of the dataset inside models, but this is insufficient for safety approaches based upon ruling out any ‘negative features’, because models only need to store enough information for the dataset to induce behaviour.
But the enumeration of features approach is naive anyways, because it ignores any compositional/dense structure in models, which is exactly the kind of thing that we would expect competent models to develop.
Something I think interpretability is uniquely equipped to do, however, is find high-level structures in models. If, for instance, models have generalized patterns of thinking, or approaches to solving problems, then we should expect these to be encoded in the model’s weights. Approaches to solving problems that generalize, which we expect models to learn, should not be tied to the specifics of particular datapoints. And these are the more safety-relevant structures to uncover, arguably, because we expect them to be the source of model capabilities.
Throwing lots of data at the wall like with SAEs can help with uncovering such structures, because they can tell us in an unsupervised manner certain intermediate representations arising from the structures. But taking these intermediate representations as atomic, as opposed to clusters in the output of general structures, is a mistake. IMO the pipeline should look something more like: find an SAE feature that seems to belong to a general category of features, and then the real mechanistic work of uncovering what general structure gives rise to this category of features should start.
How do you measure when a particular set of hyperparameters is successful? Is it based on if the decomposition matches the pre hypothesized ground truth, or on something more intrinsic?