(a) (echoing jacob_drori’s point) I’m not sure about your specific trained CLT, but I believe it is possible to construct a faithful one-layer CLT in this task (you could do everything with a MLP!). In this case, does it really matter if we are observing the circuit?
(b) I think the contribution plot is very compelling, but the case study isn’t very convincing to me as another way to phrase it seems to be “CLT missed the middle-layer feature in this particular case”.
(a) I claim we cannot construct a faithful one-layer CLT because the real circuit contains two separate steps! What we can do (and we kinda did) is to have a one-layer CLT to have perfect reconstruction always and everywhere. But the point of a CLT is NOT to have perfect reconstruction—it’s to teach us how the original model solved the task.
(b) I think that’s the point of our post! That CLTs miss middle-layer features. And that this is not a one-off error but that the sparsity loss function incentivizes solutions that miss middle-layer features. We are working on adding more case studies to make this claim more robust.
Great post! Two thoughts I had:
(a) (echoing jacob_drori’s point) I’m not sure about your specific trained CLT, but I believe it is possible to construct a faithful one-layer CLT in this task (you could do everything with a MLP!). In this case, does it really matter if we are observing the circuit?
(b) I think the contribution plot is very compelling, but the case study isn’t very convincing to me as another way to phrase it seems to be “CLT missed the middle-layer feature in this particular case”.
Thanks for your comment, I really appreciate it!
(a) I claim we cannot construct a faithful one-layer CLT because the real circuit contains two separate steps! What we can do (and we kinda did) is to have a one-layer CLT to have perfect reconstruction always and everywhere. But the point of a CLT is NOT to have perfect reconstruction—it’s to teach us how the original model solved the task.
(b) I think that’s the point of our post! That CLTs miss middle-layer features. And that this is not a one-off error but that the sparsity loss function incentivizes solutions that miss middle-layer features. We are working on adding more case studies to make this claim more robust.