It turns out that DNNs are remarkably interpretable.

Maciej Satkiewicz4 Aug 2025 22:18 UTC

12 points

Tensor Networks Interpretability (ML & AI)AI

I recently posted a paper suggesting that deep networks may harbor an implicitly linear model, recoverable via a form of gradient denoising. The method—called excitation pullback—produces crisp, human-aligned features and offers a structural lens on generalization. Just look at the explanations for ImageNet-pretrained ResNet50 on the front page of the paper.

Maciej Satkiewicz4 Aug 2025 22:18 UTC

12 points

8 comments1 min readLW link

Tensor Networks Interpretability (ML & AI)AI

interstice 5 Aug 2025 13:15 UTC
2 points
0
I haven’t fully read through your paper, but from the parts I have read sounds like it might be similar to the neural tangent kernel applied to the case of ReLU networks
- Maciej Satkiewicz 5 Aug 2025 13:18 UTC
  1 point
  0
  Parent
  Actually it is more similar to the lesser known Neural Path Kernel :) Indeed there is a specific product kernel associated with the path space, in that the path space is the RKHS of that kernel.
StefanHex 5 Aug 2025 10:05 UTC
2 points
0
Hmm, I got a couple of questions. Quoting from the abstract,

In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into.

What do you mean with linear model? In particular, do you mean “Actually DNNs are linear”? Because that’s importantly not true, linear models cannot do the things we care about.

We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment [...]

Figure 1 is less strong evidence than one might initially think. Various saliency map techniques fell for interpretability illusions in the past, see the canonical critique here.

That said, I haven’t read your full paper. Do you still think your method is working after considering the saliency map illusions?

Also, I want to reward people thinking of new interpretability ideas and talking about them, thank you for doing so!
- Maciej Satkiewicz 5 Aug 2025 11:01 UTC
  1 point
  0
  Parent
  Hi, thanks for comment!
  By “linear” I mean linear in the feature space, just like kernel machines are considered “linear” under specific data embedding.
  
  Regarding saliency maps, I still think my method can be considered faithful, in fact the whole theoretical toolset I develop serves to argue for the faithfulness of excitation pullbacks, in particular the Hypothesis 1. I argue that the model approximates a kernel machine in the path space exactly to motivate why excitation pullbacks might be faithful, i.e. they reveal the decision boundary of the more regular underlying model, pointing where the gradient noise comes from exactly (in short, I claim that gradients are noisy because they correspond to rank-1 tensors in the feature space, but the network actually learns a higher-rank feature map).
  Also notice that I perform just 5 steps of rudimentary gradient ascent in the pixel space with no additional regularisation, immediately achieving very sensible-looking results that are both input- and target-specific. Arguably the highlighted features are exactly those that humans would highlight when asked to accentuate the most salient features predicting given class.
the gears to ascension 5 Aug 2025 1:31 UTC
2 points
0
it’s only locally linear, and the nonlinearity is a lot of what’s interesting. but this does seem like a pretty cool general idea.
- Maciej Satkiewicz 5 Aug 2025 5:46 UTC
  1 point
  0
  Parent
  To be precise what I meant by “implicitly linear” is a model that is globally linear in the feature space, after transforming inputs with a fixed map. In other words—a kernel machine. The claim is that ReLU networks approximate a particular, computable kernel machine during training.
anaguma 4 Aug 2025 22:30 UTC
1 point
2
Idk, this seem fairly limited. I’d be excited for work which extends this beyond image models.
- Maciej Satkiewicz 4 Aug 2025 22:50 UTC
  2 points
  0
  Parent
  That’s fair—the current work focuses on vision models. But I’d argue it provides a concrete mechanism that could generalize to other domains. The idea that a hidden linear model lives within the activation structure isn’t specific to images; what’s needed next is to recover and validate it in text or multimodal settings.
  
  In other words: yes, it’s scoped, but it opens a door.