When using the MDL loss to motivate the simplicity loss in A.2.1, I don’t see why the rank penalty is linear in α. That is, when it says
If we consider [the two rank-1 matrices that always co-activate] as one separate component, then we only need one index to identify both of them, and therefore only need log2(C)+2α bits.
I’m not sure why this is 2α instead of α. The reasoning in the rank-1 case seems to carry over unchanged: if we use α bits of precision to store the scalar Ac(x), then a sparse A(x) vector takes ∑c∥Ac(x)∥0(α+log2(C)) bits to store. The rank of Pc doesn’t seem to play a part in this argument.
One way this could make sense is if you’re always storing Pc as a sum of rank-1 components, as later described in A.2.2. If you compute the attribution separately with respect to each rank-1 component, then it’ll take log2(C)+αrank(Pc) bits to store Ac(x) (indices + values for each component). But it seems you compute attributions with respect to Pc directly, rather than with respect to each rank-1 component separately. I’m not sure how to resolve this.
(This isn’t very important either way: if this doesn’t scale with the rank, the MDL loss would directly justify the minimality loss. You can justify penalizing the sum of ranks of Pc,l as a natural version of simplicity loss that could be added in alongside faithfulness, at the cost of a slight bit of conceptual unity.)
The idea of the motivation is indeed that you want to encode the attribution of each rank-1 piece separately. In practice, computing the attribution of Pc as a whole actually does involve calculating the attributions of all rank-1 pieces and summing them up, though you’re correct that nothing we do requires storing those intermediary results.
While it technically works out, you are pointing at a part of the math that I think is still kind of unsatisfying. If Bob calculates the attributions and sends them to Alice, why would Alice care about getting the attribution of each rank-1 pieces separately if she doesn’t need them to tell what component to activate? Why can’t Bob just sum them before he sends them? It kind of vaguely makes sense to me that Alice would want the state of a multi-dimensional object on the forward pass described with multiple numbers, but what exactly are we assuming she wants that state for? It seems that she has to be doing something with it that isn’t just running her own sparser forward pass.
I’m brooding over variations of this at the moment, trying to find something for Alice to do that connects better to what we actually want to do. Maybe she is trying to study the causal traces of some forward passes, but pawned the cost of running those traces off to Bob, and now she wants to get the shortest summary of the traces for her investigation under the constraint that uncompressing the summary shouldn’t cost her much compute. Or maybe Alice wants something else. I don’t know yet.
Not having thought about it for very long, my intuition says “minimizing the description length of A(x) definitely shouldn’t impose constraints on the components themselves,” i.e. “Alice has no use for the rank-1 attributions.” But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.
When using the MDL loss to motivate the simplicity loss in A.2.1, I don’t see why the rank penalty is linear in α. That is, when it says
I’m not sure why this is 2α instead of α. The reasoning in the rank-1 case seems to carry over unchanged: if we use α bits of precision to store the scalar Ac(x), then a sparse A(x) vector takes ∑c∥Ac(x)∥0(α+log2(C)) bits to store. The rank of Pc doesn’t seem to play a part in this argument.
One way this could make sense is if you’re always storing Pc as a sum of rank-1 components, as later described in A.2.2. If you compute the attribution separately with respect to each rank-1 component, then it’ll take log2(C)+αrank(Pc) bits to store Ac(x) (indices + values for each component). But it seems you compute attributions with respect to Pc directly, rather than with respect to each rank-1 component separately. I’m not sure how to resolve this.
(This isn’t very important either way: if this doesn’t scale with the rank, the MDL loss would directly justify the minimality loss. You can justify penalizing the sum of ranks of Pc,l as a natural version of simplicity loss that could be added in alongside faithfulness, at the cost of a slight bit of conceptual unity.)
The idea of the motivation is indeed that you want to encode the attribution of each rank-1 piece separately. In practice, computing the attribution of Pc as a whole actually does involve calculating the attributions of all rank-1 pieces and summing them up, though you’re correct that nothing we do requires storing those intermediary results.
While it technically works out, you are pointing at a part of the math that I think is still kind of unsatisfying. If Bob calculates the attributions and sends them to Alice, why would Alice care about getting the attribution of each rank-1 pieces separately if she doesn’t need them to tell what component to activate? Why can’t Bob just sum them before he sends them? It kind of vaguely makes sense to me that Alice would want the state of a multi-dimensional object on the forward pass described with multiple numbers, but what exactly are we assuming she wants that state for? It seems that she has to be doing something with it that isn’t just running her own sparser forward pass.
I’m brooding over variations of this at the moment, trying to find something for Alice to do that connects better to what we actually want to do. Maybe she is trying to study the causal traces of some forward passes, but pawned the cost of running those traces off to Bob, and now she wants to get the shortest summary of the traces for her investigation under the constraint that uncompressing the summary shouldn’t cost her much compute. Or maybe Alice wants something else. I don’t know yet.
Thanks, that’s a very helpful way of putting it!
Not having thought about it for very long, my intuition says “minimizing the description length of A(x) definitely shouldn’t impose constraints on the components themselves,” i.e. “Alice has no use for the rank-1 attributions.” But I can see why it would be nice to find a way for Alice to want that information, and you probably have deeper intuitions for this.