“explanation of (network, dataset)”: I’m afraid I don’t have a great formalish definition beyond just pointing at the intuitive notion.
What’s wrong with “proof” as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on “formal proof”, I’m in the process of producing a write-up on results exploring this.
When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in L0<p<1 in toy models of super-position, he pointed out that the gradient of L0<p<1 norm explodes near zero, meaning that features with “small errors” that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.
See here for some brief write-up and animations.