I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.
I hesitate to call grokking an example of blessings of scale because it’s still not clear what is going on there with grokking or patient teacher. They are, after all, tiny models, and patient teacher is all about distilling to small models. And the need for regularization is strange if it’s a scaling thing where larger=better: what, the regularization by tininess isn’t enough, it needs more regularization from weight decay?
I doubt grokking is unique to Transformers. The research I see as most related to grokking, the finding shallow minima paradigm with the wide basins & cyclic learning rates, are well-established for CNNs. Not finding it for some CNN is pretty weak evidence, given the grokking paper showing that you can go anywhere from like 0% to what was it 90%? depending on the details of the setup and how long you run.