shower thought: What if mech interp is already pretty good, and it turns out that the models themselves are just doing relatively uninterpretablethings?
It’s hard because you need to disentangle ‘interpretation power of method’ from ‘whether the model has anything that can be interpreted’, without any ground truth signal in the latter. Basically you need to be very confident that the interp method is good in order to make this claim.
One way you might be able to demonstrate this, is if you trained / designed toy models that you knew had some underlying interpretable structure, and showed that your interpretation methods work there. But it seems hard to construct the toy models in a realistic way while also ensuring it has the structure you want—if we could do this we wouldn’t even need interpretability.
Edit: Another method might be to show that models get more and more “uninterpretable” as you train them on more data. Ie define some metric of interpretability, like “ratio of monosemantic to polysemantic MLP neurons”, and measure this over the course of training history. This exact instantiation of the metric is probably bad but something like this could work
I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
It’s possible there’s no way to do this, which bears thinking about
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most ϵ.”.
I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
shower thought: What if mech interp is already pretty good, and it turns out that the models themselves are just doing relatively uninterpretable things?
How would we know?
I don’t know! Seems hard
It’s hard because you need to disentangle ‘interpretation power of method’ from ‘whether the model has anything that can be interpreted’, without any ground truth signal in the latter. Basically you need to be very confident that the interp method is good in order to make this claim.
One way you might be able to demonstrate this, is if you trained / designed toy models that you knew had some underlying interpretable structure, and showed that your interpretation methods work there. But it seems hard to construct the toy models in a realistic way while also ensuring it has the structure you want—if we could do this we wouldn’t even need interpretability.
Edit: Another method might be to show that models get more and more “uninterpretable” as you train them on more data. Ie define some metric of interpretability, like “ratio of monosemantic to polysemantic MLP neurons”, and measure this over the course of training history. This exact instantiation of the metric is probably bad but something like this could work
I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
The “why” or maybe “what”, instead of the “how”.
I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
It’s possible there’s no way to do this, which bears thinking about
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most ϵ.”.
Yeah, seems hard
I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.