I’d hesitate to make predictions based on the slowdown of GPT-3 to Megatron-Turing, for two reasons.
First, GPT-3 represents the fastest, largest increase in model size in this whole chart. If you only look at the models before GPT-3, the drawn trend line tracks well. Note how far off the trend GPT-3 itself is.
Second, GPT-3 was released almost exactly when COVID became a serious concern in the world beyond China. I must imagine that this slowed down model development, but it will be less of a factor going forward.
If you just look at models before GPT-3, the trend line you’d draw is still noticeably steeper than the actual line on the graph. (ELMO and BERT large are below trend while T5 and Megatron 8.3B are above.) The new Megatron would represent the biggest trend line undershoot.
Also, I think any post COVID speedup will be more than drown out by the recent slow down in the rate at which compute prices fall. They were dropping by an OOM every 4 years, but now it’s every 10-16 years.
We miss GPT-4 data point for this chart. New Megatron is more like one more replication of GPT-3, and another one is new Chinese 245 Billion parameters model. But Google had trillion parameter model in fall 2020.
Better citation for Chinese 245B model here: Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning.
Google had trillion parameter model in fall 2020.
That’s new to me. Any citation for this?
It was not google, but Microsoft: in September 2020 they wrote: “The trillion-parameter model has 298 layers of Transformers with a hidden dimension of 17,408 and is trained with sequence length 2,048 and batch size 2,048.” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
That was just the tech demo. It obviously wasn’t actually trained, just demoed for a few steps to show that the code works. If they had trained it back then, they’d, well, have announced it like OP! OP is what they’ve trained for-reals after further fiddling with that code.
I’m not sure how relevant the slowdown in compute price decrease is to this chart, since it starts in 2018 and the slowdown started 6-8 years ago; likewise, AlexNet, the breakout moment for deep learning, was 9 years ago. So if compute price is the primary rate-limiter, I’d think it would have a more gradual, consistent effect as models get bigger and bigger. The slowdown may mean that models cost quite a lot to train, but clearly huge companies like Nvidia and Microsoft haven’t started shying away yet from spending absurd amounts of money to keep growing their models.