This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.
This is an interesting question! At the end of the post / in the colab we experiment with knocking out specific singular directions and show that it differentially affects tokens of roughly the same semantics. We find this to be quite a robust effect but that actually affecting network output can be surprisingly difficult as there seems to be large amounts of redundancy where similar processing happens in many layers/blocks simultaneously.
Knocking out every interpretable/uninterpretable column is a cool idea and we haven’t tried it. My suspicion is that this would just be too much damage to the network and would scramble things but it might be worth a shot.