In addition to the original Brown et al 2020 examples, text style transfer, meta-learning instructability*, RL-finetuning of summarization, self-critique of math word problems, and maybe the improving zero-shot translation & program writing/dialogue (I’d have to double-check those), have been shown with GPT-3 and LamDA to ‘kick in’ at certain sizes going from the O(1b) models to 10-1000b. Nobody seems very surprised these days to see something work on GPT-3-173b but then not on ~1b.
* Should we count all of the examples of meta-learning / generalization which require diverse environments to get abruptly better performance out of sample, like XLand or the MuZero meta-learning paper I mention over in EfficientZero? That’s definitely a stark jump in performance: the single-environment agents, no matter how good in the primary environment, typically perform extremely poorly or even near floor in the new environment.
Thanks.. I was looking for more graphs with discontinuous jumps and “# of parameters” on the x-axis… but I think “totally new and unexpected capabilities after going from GPT-2 to GPT-3″ is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..
You can dig those ‘money shot’ capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that’s a very critical stylized fact about DL’s blessings of scale. I’m not going to look now, but Brown has the graphs, and I’m pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren’t the only thing that scales, remember!).
Also I just realized that the “grokking” phenomena is relevant here. The “grokking” paper shows jumps during training, but it’s similar. From the lens of the lottery ticket hypothesis, it’s not surprising that grokking may be easier / more likely in larger models.
I wonder how much “grokking” is new to transformers. I happened to stumble across an example in the literature where a CNN model “fails to grok” the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.
In addition to the original Brown et al 2020 examples, text style transfer, meta-learning instructability*, RL-finetuning of summarization, self-critique of math word problems, and maybe the improving zero-shot translation & program writing/dialogue (I’d have to double-check those), have been shown with GPT-3 and LamDA to ‘kick in’ at certain sizes going from the O(1b) models to 10-1000b. Nobody seems very surprised these days to see something work on GPT-3-173b but then not on ~1b.
* Should we count all of the examples of meta-learning / generalization which require diverse environments to get abruptly better performance out of sample, like XLand or the MuZero meta-learning paper I mention over in EfficientZero? That’s definitely a stark jump in performance: the single-environment agents, no matter how good in the primary environment, typically perform extremely poorly or even near floor in the new environment.
Thanks.. I was looking for more graphs with discontinuous jumps and “# of parameters” on the x-axis… but I think “totally new and unexpected capabilities after going from GPT-2 to GPT-3″ is a reasonable thing to point at, also. The scaling laws bibliography is super, super useful. I am just embarking on making my way through it now..
You can dig those ‘money shot’ capability jump graphs out of the papers, usually, I think. I try to add them to annotations when I make them because that’s a very critical stylized fact about DL’s blessings of scale. I’m not going to look now, but Brown has the graphs, and I’m pretty sure the text style transfer & RL finetuning do have the money shot graphs, and probably the others. XLand and MuZero might have them if you squint (not necessarily in parameter # - parameters aren’t the only thing that scales, remember!).
Great..
Also I just realized that the “grokking” phenomena is relevant here. The “grokking” paper shows jumps during training, but it’s similar. From the lens of the lottery ticket hypothesis, it’s not surprising that grokking may be easier / more likely in larger models.
I wonder how much “grokking” is new to transformers. I happened to stumble across an example in the literature where a CNN model “fails to grok” the Game of Life: https://arxiv.org/abs/2009.01398 .. I wonder what would happen if you used a transformer model instead..
Also, please check out my comment on your Scaling Laws bibliography page when you get a chance.
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.