Ah yes, that makes perfect sense, though I must say that the reasoning behind my not considering it significant is because the non-linearity of attention happens along the sequence dimension and not the feature dimension of the MLP blocks. In my view, attention might just amplify which token but not the underlying features within that particular token’s feature space. Again, I am a fresh graduate, so might be wrong. But I this type of topic invigorates me.
Tonny M
I might be missing something here, but isn’t it obvious that the non linearity must be responsible for the activation cliffs and plateaus? If you think of a layer as performing an origami fold in some higher dimension, then the non linearity, whatever it is, is the permanent crease, only that in this case we are dealing with an infinitely elastic paper, so it stretches the paper at the crease in order to discriminate against its inputs. If you combine that behaviour over many layers then you get what you see here in these results. The MLP blocks acting as compound filters. The source of these could not be attention, because the only thing that attention does is to rotate the inputs into a space where they can be checked for alignment with each other. It is angular, and doesn’t change the magnitude. I am open to correction and/or an explanation.
Hiya Matthew,
That’s quite an astute assertion about the possibility of attention producing plateau-like outputs, I would just like provide my rationale as to why I did not think the attention mechanism’s contribution to be significant. It is because the non-linearity in attention is strictly along the sequence axis and therefore I surmised that the attention score derived from softmax only gates and amplifies a token’s vector, but doesn’t necessarily threshold the individual features within that token.
I would like to reiterate that I could be wrong because I am only a fresh graduate, with minimal experience but the topic itself I find to be fascinating because of its implications.