[Question] To what extent are the scaling properties of Transformer networks exceptional?

Part of the point of GPT3 is that big­ger con­tinues to be bet­ter. (Com­put­er­phile dis­cus­sion.) A re­cent ques­tion asked whether this would turn out to be true for other ar­chi­tec­tures as well. But the ques­tion seemed to take for granted that we haven’t seen this phe­nomenon in other cases yet. To what ex­tent is this scal­ing phe­nomenon spe­cial to GPT? To what ex­tent is it spe­cial to Trans­former net­works? To what ex­tent is it spe­cial to un­su­per­vised NLP?

My im­pres­sion:

  • By 2011, the “big­ger is bet­ter” trend was already well-es­tab­lished in deep learn­ing. (See “Big Data” on Google Trends.) Ma­jor break­throughs in what neu­ral net­works can do (in terms of perfor­mance on tasks such as image recog­ni­tion) have gen­er­ally been fa­cil­i­tated by big­ger mod­els, more data, and more train­ing time, even in cases where there are also tech­ni­cal break­throughs (such as con­volu­tional neu­ral net­works). So, to an ex­tent, there is noth­ing spe­cial about Trans­form­ers or GPT.

  • How­ever, the data-hun­gry na­ture of deep learn­ing has meant that la­bel­led datasets are a ma­jor bot­tle­neck to scal­ing. GPT, like other un­su­per­vised learn­ing meth­ods, does not face this prob­lem. In this sense, it does have a spe­cial scal­ing ad­van­tage.

  • Fur­ther­more, for the par­tic­u­lar task of NLP, we con­tinue to see quan­ti­ta­tive and qual­i­ta­tive im­prove­ments that we care about (at least in­tel­lec­tu­ally) as we pour more money into this. In other words, NLP has a looooong and grad­ual learn­ing curve (at least if you look at it a cer­tain way). This means the task is difficult enough to see the benefits of throw­ing more at it, while easy enough to feel like you’re get­ting some­thing out of do­ing so.

No comments.