I like this article. I think it’s well-thought out reasoning of possible futures, and I think largely it matches my own views.
I especially appreciate how it goes into possible explanations for why scaling happens, not just taking it happening for granted. I think a bunch of the major points I have in my own mental models for this are hit in this article (double descent, loss landscapes, grokking).
The biggest point of disagreement I have is with grokking. I think I agree this is important, but I think the linked example (video?) isn’t correct.
First: Grokking and Metrics
It’s not surprising (to me) that we see phase changes in correctness/exact substring match scores, because they’re pretty fragile metrics—get just a part of the string wrong, and your whole answer is incorrect. For long sequences you can see these as a kind of N-Chain.
(N-chain is a simple RL task where an agent must cross a bridge, and if they miss step at any point they fall off the end. If you evaluate by “did it get to the end” then the learning exhibits a phase-change-like effect, but if you evaluate by “how far did it go”, then progress is smoother)
I weakly predict that for many ‘phase changes in results’ are similar effects to this (but not all of them, e.g. double descent).
Second: Axes Matter!
Grokking is a phenomena that happens during training, so it shows up as a phase-change-like effect on training curves (performance vs steps) -- the plots in the video are showing the results of many different models, each with their final score, plotted as scaling laws (performance vs model size).
I think it’s important to look for these sorts of nonlinear progressions in research benchmarks, but perf-vs-modelsize is very different from perf-vs-steps
Third: Grokking is about Train/Test differences
An important thing that’s going on when we see grokking is that train loss goes to zero, then much later (after training with a very small training loss) -- all of a sudden validation performance improves. (Validation loss goes down)
With benchmark evaluations like BIG-Bench, the entire evaluation is in the validation set, though I think we can consider the training set having similar contents (assuming we train on a wide natural language distribution).
Fourth: Relating this to the article
I think the high level points in the article stand—it’s important to be looking for and be wary of sudden or unexpected improvements in performance. Even if scaling laws are nice predictive functions, they aren’t gears-level models, and we should be on the lookout for them to change. Phase change like behavior in evaluation benchmarks are the kind of thing to look for.
I think that’s enough for this comment. Elsewhere I should probably writeup my mental models for what’s going on with grokking and why it happens.
Hi, I just wanted to say thanks for the comment / feedback. Yeah, I probably should have separated out the analysis of Grokking from the analysis of emergent behaviour during scaling. They are potentially related—at least for many tasks it seems Grokking becomes more likely as the model gets bigger. I’m guilty of actually conflating the two phenomena in some of my thinking, admittedly.
Your point about “fragile metrics” being more likely to show Grokking great. I had a similar thought, too.
I like this article. I think it’s well-thought out reasoning of possible futures, and I think largely it matches my own views.
I especially appreciate how it goes into possible explanations for why scaling happens, not just taking it happening for granted. I think a bunch of the major points I have in my own mental models for this are hit in this article (double descent, loss landscapes, grokking).
The biggest point of disagreement I have is with grokking. I think I agree this is important, but I think the linked example (video?) isn’t correct.
First: Grokking and Metrics
It’s not surprising (to me) that we see phase changes in correctness/exact substring match scores, because they’re pretty fragile metrics—get just a part of the string wrong, and your whole answer is incorrect. For long sequences you can see these as a kind of N-Chain.
(N-chain is a simple RL task where an agent must cross a bridge, and if they miss step at any point they fall off the end. If you evaluate by “did it get to the end” then the learning exhibits a phase-change-like effect, but if you evaluate by “how far did it go”, then progress is smoother)
I weakly predict that for many ‘phase changes in results’ are similar effects to this (but not all of them, e.g. double descent).
Second: Axes Matter!
Grokking is a phenomena that happens during training, so it shows up as a phase-change-like effect on training curves (performance vs steps) -- the plots in the video are showing the results of many different models, each with their final score, plotted as scaling laws (performance vs model size).
I think it’s important to look for these sorts of nonlinear progressions in research benchmarks, but perf-vs-modelsize is very different from perf-vs-steps
Third: Grokking is about Train/Test differences
An important thing that’s going on when we see grokking is that train loss goes to zero, then much later (after training with a very small training loss) -- all of a sudden validation performance improves. (Validation loss goes down)
With benchmark evaluations like BIG-Bench, the entire evaluation is in the validation set, though I think we can consider the training set having similar contents (assuming we train on a wide natural language distribution).
Fourth: Relating this to the article
I think the high level points in the article stand—it’s important to be looking for and be wary of sudden or unexpected improvements in performance. Even if scaling laws are nice predictive functions, they aren’t gears-level models, and we should be on the lookout for them to change. Phase change like behavior in evaluation benchmarks are the kind of thing to look for.
I think that’s enough for this comment. Elsewhere I should probably writeup my mental models for what’s going on with grokking and why it happens.
Hi, I just wanted to say thanks for the comment / feedback. Yeah, I probably should have separated out the analysis of Grokking from the analysis of emergent behaviour during scaling. They are potentially related—at least for many tasks it seems Grokking becomes more likely as the model gets bigger. I’m guilty of actually conflating the two phenomena in some of my thinking, admittedly.
Your point about “fragile metrics” being more likely to show Grokking great. I had a similar thought, too.