You should probably also be tracking kind of parameter. I see you have Switch and Gshard in there, but, as you can see in how they are visibly outliers, MoEs (and embeddings) use much weaker ‘parameters’, as it were, than dense models like GPT-3 or Turing-NLG. Plotting by FLOPS would help correct for this—perhaps we need graphs like training-FLOPS per parameter? That would also help correct for comparisons across methods, like to older architectures such as SVMs. (Unfortunately, this still obscures that the key thing about Transformers is better scaling laws than RNNs or n-grams etc, where the high FLOPS-per-parameter translates into better curves...)
Thank you for the feedback, I think what you say makes sense.
I’d be interested in seeing whether we can pin down exactly in what sense are Switch parameters “weaker”. Is it because of the lower precision? Model sparsity (is Switch sparse on parameters or just sparsely activated?)?
What do you think, what typology of parameters would make sense / be useful to include?
It’s not the numerical precision but the model architecture being sparse such that you only active a few experts at runtime, and only a small fraction of the model runs for each input. It may be 1.3t parameters or whatever, but then at runtime, only, I dunno, 20b parameters actually compute anything. This cheapness of forward passes/inferencing is the big selling point of MoE for training and deployment: that you don’t actually ever run 1.3t parameters. But it’s hard for parameters which don’t run to contribute anything to the final result, whereas in GPT-3, pretty much all of those 175b parameters can participate in each input. It’s much clearer if you think about comparing them in terms of FLOPS at runtime, rather than static parameter counts. GShard/Switch is just doing a lot less.
(I also think that the scaling curves and comparisons hint at Switch learning qualitatively worse things, and the modularity encouraging more redundancy and memorization-heavy approaches, which impedes any deeper abstractions or meta-learning-like capabilities that a deep dense model might learn. But this point is much more speculative, and not necessarily something that, say, translation researchers would care too much about.)
This point about runtime also holds for those chonky embeddings people sometimes bring up as examples of ‘models with billions of parameters’: sure, you may have a text or category embedding which has billions of ‘parameters’, but for any specific input, only a handful of those parameters actually do anything.