StefanHex comments on It turns out that DNNs are remarkably interpretable.

StefanHex 5 Aug 2025 10:05 UTC
2 points
0
Hmm, I got a couple of questions. Quoting from the abstract,

In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into.

What do you mean with linear model? In particular, do you mean “Actually DNNs are linear”? Because that’s importantly not true, linear models cannot do the things we care about.

We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment [...]

Figure 1 is less strong evidence than one might initially think. Various saliency map techniques fell for interpretability illusions in the past, see the canonical critique here.

That said, I haven’t read your full paper. Do you still think your method is working after considering the saliency map illusions?

Also, I want to reward people thinking of new interpretability ideas and talking about them, thank you for doing so!
- Maciej Satkiewicz 5 Aug 2025 11:01 UTC
  1 point
  0
  Parent
  Hi, thanks for comment!
  By “linear” I mean linear in the feature space, just like kernel machines are considered “linear” under specific data embedding.
  
  Regarding saliency maps, I still think my method can be considered faithful, in fact the whole theoretical toolset I develop serves to argue for the faithfulness of excitation pullbacks, in particular the Hypothesis 1. I argue that the model approximates a kernel machine in the path space exactly to motivate why excitation pullbacks might be faithful, i.e. they reveal the decision boundary of the more regular underlying model, pointing where the gradient noise comes from exactly (in short, I claim that gradients are noisy because they correspond to rank-1 tensors in the feature space, but the network actually learns a higher-rank feature map).
  Also notice that I perform just 5 steps of rudimentary gradient ascent in the pixel space with no additional regularisation, immediately achieving very sensible-looking results that are both input- and target-specific. Arguably the highlighted features are exactly those that humans would highlight when asked to accentuate the most salient features predicting given class.