You aren’t thinking of this Chris Olah article, are you? If not, I’d love to hear if you ever remember what it was.
As for applying this theory to transformers. That is a very good question! I wish I had a good answer, but that is very much one of my next goals. I want to become more familiar with the transformer architecture, getting some actual hand’s on MI experience (I’ve only worked with vanilla and conv nets) and analyzing their structure from a math / geometry / topology perspective.
I will say that it seems to me that all “special” architectures I’ve looked at so far can be viewed mathematically as a very big vanilla network with specific rules for the weights. For example, LSTMs can be unrolled to make very deep networks with the rule that every one of the “unrolled” networks shares their weights with each of the others. For Conv layers, it is as if we have taken a FC layer, and set every weight connected to a far away pixel to 0 and then set each output pixel to share the kernel weights with all of the others.
I suspect something like this is true for transformers as well, but I haven’t yet studied the architecture well enough to say with confidence, much less to be able to draw useful implications about transformers from the concept, but I’m optimistic about it, and plan to pursue the idea over the coming months.
You aren’t thinking of this Chris Olah article, are you? If not, I’d love to hear if you ever remember what it was.
As for applying this theory to transformers. That is a very good question! I wish I had a good answer, but that is very much one of my next goals. I want to become more familiar with the transformer architecture, getting some actual hand’s on MI experience (I’ve only worked with vanilla and conv nets) and analyzing their structure from a math / geometry / topology perspective.
I will say that it seems to me that all “special” architectures I’ve looked at so far can be viewed mathematically as a very big vanilla network with specific rules for the weights. For example, LSTMs can be unrolled to make very deep networks with the rule that every one of the “unrolled” networks shares their weights with each of the others. For Conv layers, it is as if we have taken a FC layer, and set every weight connected to a far away pixel to 0 and then set each output pixel to share the kernel weights with all of the others.
I suspect something like this is true for transformers as well, but I haven’t yet studied the architecture well enough to say with confidence, much less to be able to draw useful implications about transformers from the concept, but I’m optimistic about it, and plan to pursue the idea over the coming months.