That’s quite an astute assertion about the possibility of attention producing plateau-like outputs, I would just like provide my rationale as to why I did not think the attention mechanism’s contribution to be significant. It is because the non-linearity in attention is strictly along the sequence axis and therefore I surmised that the attention score derived from softmax only gates and amplifies a token’s vector, but doesn’t necessarily threshold the individual features within that token.
I would like to reiterate that I could be wrong because I am only a fresh graduate, with minimal experience but the topic itself I find to be fascinating because of its implications.
Hiya Matthew,
That’s quite an astute assertion about the possibility of attention producing plateau-like outputs, I would just like provide my rationale as to why I did not think the attention mechanism’s contribution to be significant. It is because the non-linearity in attention is strictly along the sequence axis and therefore I surmised that the attention score derived from softmax only gates and amplifies a token’s vector, but doesn’t necessarily threshold the individual features within that token.
I would like to reiterate that I could be wrong because I am only a fresh graduate, with minimal experience but the topic itself I find to be fascinating because of its implications.