I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
It’s possible there’s no way to do this, which bears thinking about
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most ϵ.”.
I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
It’s possible there’s no way to do this, which bears thinking about
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most ϵ.”.
Yeah, seems hard
I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.