Alex Semendinger comments on Attribution-based parameter decomposition

Alex Semendinger 29 Jan 2025 2:27 UTC
10 points
2
When using the MDL loss to motivate the simplicity loss in A.2.1, I don’t see why the rank penalty is linear in $α$ . That is, when it says
If we consider [the two rank-1 matrices that always co-activate] as one separate component, then we only need one index to identify both of them, and therefore only need ${log}_{2} (C) + 2 α$ bits.
I’m not sure why this is $2 α$ instead of $α$ . The reasoning in the rank-1 case seems to carry over unchanged: if we use $α$ bits of precision to store the scalar $A_{c} (x)$ , then a sparse $A (x)$ vector takes $\sum_{c} ∥ A_{c} (x) ∥_{0} (α + {log}_{2} (C))$ bits to store. The rank of $P_{c}$ doesn’t seem to play a part in this argument.
One way this could make sense is if you’re always storing $P_{c}$ as a sum of rank-1 components, as later described in A.2.2. If you compute the attribution separately with respect to each rank-1 component, then it’ll take ${log}_{2} (C) + α rank (P_{c})$ bits to store $A_{c} (x)$ (indices + values for each component). But it seems you compute attributions with respect to $P_{c}$ directly, rather than with respect to each rank-1 component separately. I’m not sure how to resolve this.
(This isn’t very important either way: if this doesn’t scale with the rank, the MDL loss would directly justify the minimality loss. You can justify penalizing the sum of ranks of $P_{c, l}$ as a natural version of simplicity loss that could be added in alongside faithfulness, at the cost of a slight bit of conceptual unity.)
- Lucius Bushnaq 29 Jan 2025 6:22 UTC
  3 points
  2
  Parent
  The idea of the motivation is indeed that you want to encode the attribution of each rank-1 piece separately. In practice, computing the attribution of $P_{c}$ as a whole actually does involve calculating the attributions of all rank-1 pieces and summing them up, though you’re correct that nothing we do requires storing those intermediary results.
  While it technically works out, you are pointing at a part of the math that I think is still kind of unsatisfying. If Bob calculates the attributions and sends them to Alice, why would Alice care about getting the attribution of each rank-1 pieces separately if she doesn’t need them to tell what component to activate? Why can’t Bob just sum them before he sends them? It kind of vaguely makes sense to me that Alice would want the state of a multi-dimensional object on the forward pass described with multiple numbers, but what exactly are we assuming she wants that state for? It seems that she has to be doing something with it that isn’t just running her own sparser forward pass.
  
  I’m brooding over variations of this at the moment, trying to find something for Alice to do that connects better to what we actually want to do. Maybe she is trying to study the causal traces of some forward passes, but pawned the cost of running those traces off to Bob, and now she wants to get the shortest summary of the traces for her investigation under the constraint that uncompressing the summary shouldn’t cost her much compute. Or maybe Alice wants something else. I don’t know yet.
  - Alex Semendinger 31 Jan 2025 3:08 UTC
    1 point
    0
    Parent
    Thanks, that’s a very helpful way of putting it!
    Not having thought about it for very long, my intuition says “minimizing the description length of $A (x)$ definitely shouldn’t impose constraints on the components themselves,” i.e. “Alice has no use for the rank-1 attributions.” But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.