I agree with much of this, but I suspect people aren’t only sticking with activation-based interpretability because the bad dimensionality of weight-based interpretability is intimidating. Rather, I feel like we have to be thinking about activation-based interpretability if we want an analysis of the model’s behavior to contain semantics that are safety-relevant.
For example, I can know nothing about how safe a classifier that distinguishes A from B is, regardless of how much I know about its weights, unless I know what A and B are. There might be identical sets of weights that are safe if acting on one problem distribution but unsafe if acting on another. We’ve got to characterize the problem distribution to assess safety.
By the good regulator theorem, it is true that if a model works well, it will carry some kind of weak copy of the training dataset inside its weights. In this sense, it might be reasonable to think that models might “inherit” semantic features of the datasets they’re operating on, and maybe this provides some way to bridge between weight based interpretability and human values regarding the outputs of model. However, lots and lots of information can be lost in the process because a good regulator only cares about reflecting the data insofar as doing so improves its performance on the task it’s conducting. Anything that isn’t a sufficient statistic for doing that task given the particular dataset it has on hand can be lost.
I really feel like omitted variables make any form of interpretability, weight based or not, a doomed strategy for safety, at least interpretability in a vacuum without supplementation from other strategies. How are we supposed to detect the absence of a concept or value that the model ought to possess just from its weights or activations? Interpretability can only tell us what a model’s doing, but nothing about what it’s not. Analyzing the safety of the outputs doesn’t scale well for models that are smarter than us, but I think it’s a necessary requirement regardless.
(We might try to characterize omitted variables as undesirable invariances in the model’s behavior, changes to the inputs that should change its activations in certain ways but don’t. However, exhaustively describing all the undesired invariances or knowing which changes to make to the inputs or the corresponding activations we want to see associated with them all seems prohibitive, and this risks running into problems with the principle of indifference, so I don’t think it can work.)
I mostly agree with this analysis. But I think there are better safety cases for interp beyond enumeration of features. As you say, there might be shallow copies of the dataset inside models, but this is insufficient for safety approaches based upon ruling out any ‘negative features’, because models only need to store enough information for the dataset to induce behaviour.
But the enumeration of features approach is naive anyways, because it ignores any compositional/dense structure in models, which is exactly the kind of thing that we would expect competent models to develop.
Something I think interpretability is uniquely equipped to do, however, is find high-level structures in models. If, for instance, models have generalized patterns of thinking, or approaches to solving problems, then we should expect these to be encoded in the model’s weights. Approaches to solving problems that generalize, which we expect models to learn, should not be tied to the specifics of particular datapoints. And these are the more safety-relevant structures to uncover, arguably, because we expect them to be the source of model capabilities.
Throwing lots of data at the wall like with SAEs can help with uncovering such structures, because they can tell us in an unsupervised manner certain intermediate representations arising from the structures. But taking these intermediate representations as atomic, as opposed to clusters in the output of general structures, is a mistake. IMO the pipeline should look something more like: find an SAE feature that seems to belong to a general category of features, and then the real mechanistic work of uncovering what general structure gives rise to this category of features should start.
I agree with much of this, but I suspect people aren’t only sticking with activation-based interpretability because the bad dimensionality of weight-based interpretability is intimidating. Rather, I feel like we have to be thinking about activation-based interpretability if we want an analysis of the model’s behavior to contain semantics that are safety-relevant.
For example, I can know nothing about how safe a classifier that distinguishes A from B is, regardless of how much I know about its weights, unless I know what A and B are. There might be identical sets of weights that are safe if acting on one problem distribution but unsafe if acting on another. We’ve got to characterize the problem distribution to assess safety.
By the good regulator theorem, it is true that if a model works well, it will carry some kind of weak copy of the training dataset inside its weights. In this sense, it might be reasonable to think that models might “inherit” semantic features of the datasets they’re operating on, and maybe this provides some way to bridge between weight based interpretability and human values regarding the outputs of model. However, lots and lots of information can be lost in the process because a good regulator only cares about reflecting the data insofar as doing so improves its performance on the task it’s conducting. Anything that isn’t a sufficient statistic for doing that task given the particular dataset it has on hand can be lost.
I really feel like omitted variables make any form of interpretability, weight based or not, a doomed strategy for safety, at least interpretability in a vacuum without supplementation from other strategies. How are we supposed to detect the absence of a concept or value that the model ought to possess just from its weights or activations? Interpretability can only tell us what a model’s doing, but nothing about what it’s not. Analyzing the safety of the outputs doesn’t scale well for models that are smarter than us, but I think it’s a necessary requirement regardless.
(We might try to characterize omitted variables as undesirable invariances in the model’s behavior, changes to the inputs that should change its activations in certain ways but don’t. However, exhaustively describing all the undesired invariances or knowing which changes to make to the inputs or the corresponding activations we want to see associated with them all seems prohibitive, and this risks running into problems with the principle of indifference, so I don’t think it can work.)
I mostly agree with this analysis. But I think there are better safety cases for interp beyond enumeration of features. As you say, there might be shallow copies of the dataset inside models, but this is insufficient for safety approaches based upon ruling out any ‘negative features’, because models only need to store enough information for the dataset to induce behaviour.
But the enumeration of features approach is naive anyways, because it ignores any compositional/dense structure in models, which is exactly the kind of thing that we would expect competent models to develop.
Something I think interpretability is uniquely equipped to do, however, is find high-level structures in models. If, for instance, models have generalized patterns of thinking, or approaches to solving problems, then we should expect these to be encoded in the model’s weights. Approaches to solving problems that generalize, which we expect models to learn, should not be tied to the specifics of particular datapoints. And these are the more safety-relevant structures to uncover, arguably, because we expect them to be the source of model capabilities.
Throwing lots of data at the wall like with SAEs can help with uncovering such structures, because they can tell us in an unsupervised manner certain intermediate representations arising from the structures. But taking these intermediate representations as atomic, as opposed to clusters in the output of general structures, is a mistake. IMO the pipeline should look something more like: find an SAE feature that seems to belong to a general category of features, and then the real mechanistic work of uncovering what general structure gives rise to this category of features should start.