I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
It’s possible there’s no way to do this, which bears thinking about
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most ϵ.”.
I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
The “why” or maybe “what”, instead of the “how”.
I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
It’s possible there’s no way to do this, which bears thinking about
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most ϵ.”.
Yeah, seems hard
I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.