Danielle Ensign comments on Attribution-based parameter decomposition

Danielle Ensign 10 Feb 2025 22:26 UTC
1 point
0

There are certain cases where pure gradient-based attributions predictably don’t work (most notably when a softmax is saturated)

Do you have a source or writeup somewhere on this? (or do you mind explaining more/have some examples where this is true?) Is this issue actually something that comes up for modern day LLMs?

In my observations it works fine for the toy tasks people have tried it on. The challenge seems to be in interpreting the attributions, not issues with the attributions themselves.
- Lee Sharkey 11 Feb 2025 13:28 UTC
  4 points
  0
  Parent
  Figure 3 in this paper (AtP*, Kramar et al.) illustrates the point nicely: https://arxiv.org/abs/2403.00745
- tailcalled 11 Feb 2025 9:13 UTC
  2 points
  0
  Parent
  It’s elementary that the derivative approaches zero when one of the inputs to a softmax is significantly bigger than the others. Then when applying the chain rule, this entire pathway for the gradient gets knocked out.
  I don’t know to what extent it comes up with modern day LLMs. Certainly I bet one could generate a lot of interpretability work within the linear approximation regime. I guess at some point it reduces to the question of why to do mechanistic interpretability in the first place.