To what extent are the scaling properties of Transformer networks exceptional?

[Question] To what extent are the scaling properties of Transformer networks exceptional?

Part of the point of GPT3 is that bigger continues to be better. (Computerphile discussion.) A recent question asked whether this would turn out to be true for other architectures as well. But the question seemed to take for granted that we haven’t seen this phenomenon in other cases yet. To what extent is this scaling phenomenon special to GPT? To what extent is it special to Transformer networks? To what extent is it special to unsupervised NLP?

My impression:

By 2011, the “bigger is better” trend was already well-established in deep learning. (See “Big Data” on Google Trends.) Major breakthroughs in what neural networks can do (in terms of performance on tasks such as image recognition) have generally been facilitated by bigger models, more data, and more training time, even in cases where there are also technical breakthroughs (such as convolutional neural networks). So, to an extent, there is nothing special about Transformers or GPT.
However, the data-hungry nature of deep learning has meant that labelled datasets are a major bottleneck to scaling. GPT, like other unsupervised learning methods, does not face this problem. In this sense, it does have a special scaling advantage.
Furthermore, for the particular task of NLP, we continue to see quantitative and qualitative improvements that we care about (at least intellectually) as we pour more money into this. In other words, NLP has a looooong and gradual learning curve (at least if you look at it a certain way). This means the task is difficult enough to see the benefits of throwing more at it, while easy enough to feel like you’re getting something out of doing so.

abramdemski28 Jul 2020 20:06 UTC

LW: 30 AF: 14

1 comment1 min readLW link

GPT AI

Gordon Seidoh Worley 29 Jul 2020 2:23 UTC
LW: 2 AF: 1
AF
Most systems eventually face scaling bottlenecks. In fact, unless your system is completely free of coordination, it definitely has bottlenecks even if you haven’t scaled large enough to hit them. And since Transformers definitely require some coordination since no matter how large the models are and how much parallelism their hardware supports they still produce a single reduced output, we should expect that there are some scaling limits on Transformers that at some size will prevent them for effectively taking advantage of having a larger network.
Further, you point at this a bit, but most systems also experiencing diminishing returns on performance for additional resources because of these constraints.
Transformers may just be special in that they have yet to start hitting diminishing returns because we haven’t yet run up against their coordination bottlenecks, although that doesn’t make them too special since we should expect them to still have them lying in wait somewhere, just like they do in every other system that is not coordination free.

No comments.