Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.