beren comments on The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

beren 2 Dec 2022 15:58 UTC
1 point
0
This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.