Tonny M

Karma: 1

Tonny M 10 May 2026 23:20 UTC
1 point
0
in reply to: StefanHex’s comment on: Activation Plateaus: Where and How They Emerge
Ah yes, that makes perfect sense, though I must say that the reasoning behind my not considering it significant is because the non-linearity of attention happens along the sequence dimension and not the feature dimension of the MLP blocks. In my view, attention might just amplify which token but not the underlying features within that particular token’s feature space. Again, I am a fresh graduate, so might be wrong. But I this type of topic invigorates me.

Tonny M 3 May 2026 0:28 UTC
2 points
0
on: Activation Plateaus: Where and How They Emerge
I might be missing something here, but isn’t it obvious that the non linearity must be responsible for the activation cliffs and plateaus? If you think of a layer as performing an origami fold in some higher dimension, then the non linearity, whatever it is, is the permanent crease, only that in this case we are dealing with an infinitely elastic paper, so it stretches the paper at the crease in order to discriminate against its inputs. If you combine that behaviour over many layers then you get what you see here in these results. The MLP blocks acting as compound filters. The source of these could not be attention, because the only thing that attention does is to rotate the inputs into a space where they can be checked for alignment with each other. It is angular, and doesn’t change the magnitude. I am open to correction and/or an explanation.