The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).
The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).
Ah right. Thank you!