dsj comments on [missing post]

dsj 18 Jul 2022 15:50 UTC
9 points
The formula is assuming a linear compute cost in number of parameters, not in network width. Fully-connected layers have a number of parameters quadratic in network width, one for each connection between neuron pairs (and this is true for non-transformers as much as transformers).
- aogara 18 Jul 2022 16:43 UTC
  2 points
  Parent
  Ah right. Thank you!