Thanks for posting this! Your description of transformations between layers, squashing & folding etc., reminds me of some old-school ML explanations about “how to multi-layer perceptrons work” (this is not meant as a bad thing, but a potential direction to look into!), I can’t think of references right now.
It also reminds me of Victor Veitch’s group’s work, e.g. Park et al., though pay special attention to the refutation(?) of this particular paper.
Finally, I can imagine connecting what you say to my own research agenda around “activation plateaus” / “stable regions”. I’m in the progress of producing a better write-up to explain my ideas, but essentially I have the impression that NNs map discrete regions of activation space to specific activations later on in a model (squashing?), and wonder whether we can make use of these regions.
Your description of “old-school ML explanations” makes me think of this Chris Olah article. It, along with the work of Mingwei Li ( grand tour, umap tour ) and a bunch of time spent trying to reason about the math and geometry of NNs, is what I base my current POV on.
map discrete regions of activation space to specific activations later on
If I understand correctly this corresponds with one of my key claims, that “position” not “direction” is fundamental to semantics in activation spaces. If “direction” is relevant, it is possible for it to be local and distort over distance, more like a vector field then a single vector that applies to the whole space.
In this and the following section in this video I give some more description of the idea if you are interested.
I’m planning to do self study from next week after I finish my final exam until November, and one of the things I want to do is a deep dive on the transformer architecture and attempt to extend my understanding of these concepts as they apply to vanilla and conv nets to transformer based nets.
Thanks for posting this! Your description of transformations between layers, squashing & folding etc., reminds me of some old-school ML explanations about “how to multi-layer perceptrons work” (this is not meant as a bad thing, but a potential direction to look into!), I can’t think of references right now.
It also reminds me of Victor Veitch’s group’s work, e.g. Park et al., though pay special attention to the refutation(?) of this particular paper.
Finally, I can imagine connecting what you say to my own research agenda around “activation plateaus” / “stable regions”. I’m in the progress of producing a better write-up to explain my ideas, but essentially I have the impression that NNs map discrete regions of activation space to specific activations later on in a model (squashing?), and wonder whether we can make use of these regions.
Hey! Thanks for the links. I’ll look into them.
Your description of “old-school ML explanations” makes me think of this Chris Olah article. It, along with the work of Mingwei Li ( grand tour, umap tour ) and a bunch of time spent trying to reason about the math and geometry of NNs, is what I base my current POV on.
If I understand correctly this corresponds with one of my key claims, that “position” not “direction” is fundamental to semantics in activation spaces. If “direction” is relevant, it is possible for it to be local and distort over distance, more like a vector field then a single vector that applies to the whole space.
In this and the following section in this video I give some more description of the idea if you are interested.
I’m planning to do self study from next week after I finish my final exam until November, and one of the things I want to do is a deep dive on the transformer architecture and attempt to extend my understanding of these concepts as they apply to vanilla and conv nets to transformer based nets.
I look forward to seeing your future work!