I believe this article would benefit from some investigation of the NanoGPT speedrun: a challenge, running since May 2024, of training GPT-2 Small 124M on a certain dataset to a certain loss. As a starting point, you could check my comment on the topic from last month and reproduce findings by T. Besiroglu and yours truly.
In order not to duplicate the comment but still add something to what I have written on the topic, let me put a three-paragraph summary of the trend line analysis below, noting that the progression in calendar time (as opposed to record number) is very uneven:
Gemini’s summary of the QLR analysis of the speedrun progression (written a month ago)
The “Flex” Point: The QLR test points to Record 12 as the most significant structural break. This coincides with the transition from dense causal attention to FlexAttention, which enabled much faster training by optimizing the attention mechanism.
Diminishing Returns: The slope is steeper in the first phase (-0.80) than in the second (-0.53). This indicates that early “low-hanging fruit” optimizations (like introducing the Muon optimizer and standardizing architecture) provided a faster rate of improvement per record than the later, more incremental system and hyperparameter tweaks.
Stability: After the “saturation” phase begins around Record 12–15, the progress remains remarkably consistent on a log-log scale, following a new, slightly shallower power law as contributors fought for smaller second-by-second gains.
I believe this article would benefit from some investigation of the NanoGPT speedrun: a challenge, running since May 2024, of training GPT-2 Small 124M on a certain dataset to a certain loss. As a starting point, you could check my comment on the topic from last month and reproduce findings by T. Besiroglu and yours truly.
In order not to duplicate the comment but still add something to what I have written on the topic, let me put a three-paragraph summary of the trend line analysis below, noting that the progression in calendar time (as opposed to record number) is very uneven:
Gemini’s summary of the QLR analysis of the speedrun progression (written a month ago)
The “Flex” Point: The QLR test points to Record 12 as the most significant structural break. This coincides with the transition from dense causal attention to FlexAttention, which enabled much faster training by optimizing the attention mechanism.
Diminishing Returns: The slope is steeper in the first phase (-0.80) than in the second (-0.53). This indicates that early “low-hanging fruit” optimizations (like introducing the Muon optimizer and standardizing architecture) provided a faster rate of improvement per record than the later, more incremental system and hyperparameter tweaks.
Stability: After the “saturation” phase begins around Record 12–15, the progress remains remarkably consistent on a log-log scale, following a new, slightly shallower power law as contributors fought for smaller second-by-second gains.