Tricks that work on smaller scales often don’t generalize to larger scales.
Tricks that work on larger scales often don’t work on smaller scales (due to bigger ML models having various novel emergent properties).
My understanding is that these two claims are mostly false in practice. In particular, there have been a few studies (like e.g. this) which try to run yesterday’s algorithms with today’s scale, and today’s algorithms with yesterday’s scale, in order to attribute progress to scale vs algorithmic improvements. I haven’t gone through those studies in very careful detail, but my understanding is that they pretty consistently find today’s algorithms outperform yesterday’s algorithms even when scaled down, and yesterday’s algorithms underperform today’s even when scaled up. So unless I’ve badly misunderstood those studies, the mental model in which different tricks work best on different scales is basically just false, at least at the range of different scales the field has gone through in the past ~decade.
That said, there are cases where I could imagine Ilya’s claim making sense, e.g. if the “experiments” he’s talking about are experiments in using the net rather than training the net. Certainly one can do qualitatively different things with GPT4 than GPT2, so if one is testing e.g. a scaffolding setup or a net’s ability to play a particular game, then one needs to use the larger net. Perhaps that’s what Ilya had in mind?
I could imagine Ilya’s claim making sense, e.g. if the “experiments” he’s talking about are experiments in using the net rather than training the net
What I had in mind is something along these lines. More capable models[1] have various emergent properties. Specific tricks can rely on those properties being present, and work better or worse depending on that.
For example, the o-series training loop probably can’t actually “get off the ground” if the base model is only as smart as GPT-2: the model would ~never find its way to correct answers, so it’d never get reinforcement signals. You can still force it to work by sampling a billion guesses or by starting it with very easy problems (e. g., basic arithmetics?), but it’d probably deliver much less impressive results than applying it to GPT-4.
Scaling further down: I don’t recall if GPT-2 can make productive use of CoTs, but presumably e. g. GPT-1 can’t. At that point, the whole “do RL on CoTs” completely ceases to be a meaningful thing to try.
Generalizing: At a lower level of capabilities, there’s presumably a ton of various tricks that deliver a small bump to performance. Some of those tricks would have an effect size comparable to RL-on-CoTs-if-applied-at-this-scale. But out of a sea of those tricks, only a few of them would be such that their effectiveness rises dramatically with scale.
So, a more refined way to make my points would be:
If a trick shows promise at a small capability level, e. g. improving performance 10%, it doesn’t mean it’d show a similar 10%-improvement if applied at a higher capability level.
(Say, because it addresses a deficiency that a big-enough model just doesn’t have/which a big-enough pretraining run solves by default.)
If a trick shows marginal/no improvement at a small capability level, that doesn’t mean it won’t show a dramatic improvement at a higher capability level.
a few studies (like e.g. this) which try to run yesterday’s algorithms with today’s scale, and today’s algorithms with yesterday’s scale
My guess, based on the above, would be that even if today’s algorithms perform better than yesterday’s algorithms at smaller scales, the difference between their small-scale capabilities is less than the difference between yesterday’s algorithms and today’s algorithms at bigger scales. I. e.: some algorithms make nonlinearly better use of compute, such that figuring out which tricks are the best is easier at larger scales. (Telling apart a 5% capability improvement from a 80% one.)
Whether they’re more capable by dint of being bigger (GPT-4), or being trained on better data (Sonnet 3.5.1), or having a better training loop + architecture (DeepSeek V3), etc.
My understanding is that these two claims are mostly false in practice. In particular, there have been a few studies (like e.g. this) which try to run yesterday’s algorithms with today’s scale, and today’s algorithms with yesterday’s scale, in order to attribute progress to scale vs algorithmic improvements. I haven’t gone through those studies in very careful detail, but my understanding is that they pretty consistently find today’s algorithms outperform yesterday’s algorithms even when scaled down, and yesterday’s algorithms underperform today’s even when scaled up. So unless I’ve badly misunderstood those studies, the mental model in which different tricks work best on different scales is basically just false, at least at the range of different scales the field has gone through in the past ~decade.
That said, there are cases where I could imagine Ilya’s claim making sense, e.g. if the “experiments” he’s talking about are experiments in using the net rather than training the net. Certainly one can do qualitatively different things with GPT4 than GPT2, so if one is testing e.g. a scaffolding setup or a net’s ability to play a particular game, then one needs to use the larger net. Perhaps that’s what Ilya had in mind?
What I had in mind is something along these lines. More capable models[1] have various emergent properties. Specific tricks can rely on those properties being present, and work better or worse depending on that.
For example, the o-series training loop probably can’t actually “get off the ground” if the base model is only as smart as GPT-2: the model would ~never find its way to correct answers, so it’d never get reinforcement signals. You can still force it to work by sampling a billion guesses or by starting it with very easy problems (e. g., basic arithmetics?), but it’d probably deliver much less impressive results than applying it to GPT-4.
Scaling further down: I don’t recall if GPT-2 can make productive use of CoTs, but presumably e. g. GPT-1 can’t. At that point, the whole “do RL on CoTs” completely ceases to be a meaningful thing to try.
Generalizing: At a lower level of capabilities, there’s presumably a ton of various tricks that deliver a small bump to performance. Some of those tricks would have an effect size comparable to RL-on-CoTs-if-applied-at-this-scale. But out of a sea of those tricks, only a few of them would be such that their effectiveness rises dramatically with scale.
So, a more refined way to make my points would be:
If a trick shows promise at a small capability level, e. g. improving performance 10%, it doesn’t mean it’d show a similar 10%-improvement if applied at a higher capability level.
(Say, because it addresses a deficiency that a big-enough model just doesn’t have/which a big-enough pretraining run solves by default.)
If a trick shows marginal/no improvement at a small capability level, that doesn’t mean it won’t show a dramatic improvement at a higher capability level.
My guess, based on the above, would be that even if today’s algorithms perform better than yesterday’s algorithms at smaller scales, the difference between their small-scale capabilities is less than the difference between yesterday’s algorithms and today’s algorithms at bigger scales. I. e.: some algorithms make nonlinearly better use of compute, such that figuring out which tricks are the best is easier at larger scales. (Telling apart a 5% capability improvement from a 80% one.)
Whether they’re more capable by dint of being bigger (GPT-4), or being trained on better data (Sonnet 3.5.1), or having a better training loop + architecture (DeepSeek V3), etc.