[Question] To what extent are the scaling properties of Transformer networks exceptional?

Part of the point of GPT3 is that bigger continues to be better. (Computerphile discussion.) A recent question asked whether this would turn out to be true for other architectures as well. But the question seemed to take for granted that we haven’t seen this phenomenon in other cases yet. To what extent is this scaling phenomenon special to GPT? To what extent is it special to Transformer networks? To what extent is it special to unsupervised NLP?

My impression:

  • By 2011, the “bigger is better” trend was already well-established in deep learning. (See “Big Data” on Google Trends.) Major breakthroughs in what neural networks can do (in terms of performance on tasks such as image recognition) have generally been facilitated by bigger models, more data, and more training time, even in cases where there are also technical breakthroughs (such as convolutional neural networks). So, to an extent, there is nothing special about Transformers or GPT.

  • However, the data-hungry nature of deep learning has meant that labelled datasets are a major bottleneck to scaling. GPT, like other unsupervised learning methods, does not face this problem. In this sense, it does have a special scaling advantage.

  • Furthermore, for the particular task of NLP, we continue to see quantitative and qualitative improvements that we care about (at least intellectually) as we pour more money into this. In other words, NLP has a looooong and gradual learning curve (at least if you look at it a certain way). This means the task is difficult enough to see the benefits of throwing more at it, while easy enough to feel like you’re getting something out of doing so.