Noticed that you use task cross entropy loss instead of KL when learning task masks (Appendix 4.5, Loss Function) This is maybe a reasonable design choice, but important to note that this will ablate any “negative” nodes (and indirectly cause you to ignore positive nodes which overcome the negative nodes).
Overall, I suspect that this causes the subnetworks to miss important model computations (but obviously decreases the size of the subnetwork)
Aside: is there a reason there isn’t a top-level link-post for this paper? (if not I’ll create one)
Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.
Noticed that you use task cross entropy loss instead of KL when learning task masks (Appendix 4.5, Loss Function) This is maybe a reasonable design choice, but important to note that this will ablate any “negative” nodes (and indirectly cause you to ignore positive nodes which overcome the negative nodes).
Overall, I suspect that this causes the subnetworks to miss important model computations (but obviously decreases the size of the subnetwork)
Aside: is there a reason there isn’t a top-level link-post for this paper? (if not I’ll create one)
Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.