it took people about 8 months to accelerate Andrej Karpathy’s PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2
The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).
It’s much harder to improve on GPT-4 or Llama-3 that much.
what’s even more remarkable is that almost all that acceleration is due to better sample efficiency with the required training data dropping from 10 billion tokens to 0.73 billion tokens on the same training set with the fixed order of training tokens
That’s just in the rules of the game, the number of model parameters isn’t allowed to change, so in order to reduce training FLOPs (preserving perplexity) they reduce the amount of data. It also incidentally improves optimality of tokens/parameter ratio, though at 0.73B tokens it already overshoots, turning the initial overtrained 10B token model into a slightly undertrained model.
The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).
It’s much harder to improve on GPT-4 or Llama-3 that much.
That’s just in the rules of the game, the number of model parameters isn’t allowed to change, so in order to reduce training FLOPs (preserving perplexity) they reduce the amount of data. It also incidentally improves optimality of tokens/parameter ratio, though at 0.73B tokens it already overshoots, turning the initial overtrained 10B token model into a slightly undertrained model.