The alternative hypothesis does need to be said, especially after someone at a party outright claimed it was obviously true, and with the general consensus that the previous export controls were not all that tight. That alternative hypothesis is that DeepSeek is lying and actually used a lot more compute and chips it isn’t supposed to have. I can’t rule it out.
Re DeepSeek cost-efficiency, we are seeing more claims pointing in that direction.
But all this does seem to be well within what’s possible. Here is the famous https://github.com/KellerJordan/modded-nanogpt ongoing competition, and it took people about 8 months to accelerate Andrej Karpathy’s PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2 (what’s even more remarkable is that almost all that acceleration is due to better sample efficiency with the required training data dropping from 10 billion tokens to 0.73 billion tokens on the same training set with the fixed order of training tokens).
Some of the techniques used by the community pursuing this might not scale to really large models, but most of them probably would scale (as we see in their mid-Oct experiment demonstrating scaling of what has been 3-4x acceleration back then to the 1.5B version).
So when an org is claiming 10x-20x efficiency jump compared to what it presumably took a year or more ago, I am inclined to say, “why not, and probably the leaders are also in possession of similar techniques now, even if they are less pressed by compute shortage”.
The real question is how fast will these numbers continue to go down for the similar levels of performance… It’s has been very expensive to be the very first org achieving a given new level, but the cost seems to be dropping rapidly for the followers...
it took people about 8 months to accelerate Andrej Karpathy’s PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2
The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).
It’s much harder to improve on GPT-4 or Llama-3 that much.
what’s even more remarkable is that almost all that acceleration is due to better sample efficiency with the required training data dropping from 10 billion tokens to 0.73 billion tokens on the same training set with the fixed order of training tokens
That’s just in the rules of the game, the number of model parameters isn’t allowed to change, so in order to reduce training FLOPs (preserving perplexity) they reduce the amount of data. It also incidentally improves optimality of tokens/parameter ratio, though at 0.73B tokens it already overshoots, turning the initial overtrained 10B token model into a slightly undertrained model.
Re DeepSeek cost-efficiency, we are seeing more claims pointing in that direction.
In a similarly unverified claim, the founder of 01.ai (who is sufficiently known in the US according to https://en.wikipedia.org/wiki/Kai-Fu_Lee) seems to be claiming that the training cost of their Yi-Lightning model is only 3 million dollars or so. Yi-Lightning is a very strong model released in mid-Oct-2024 (when one compares it to DeepSeek-V3, one might want to check “math” and “coding” subcategories on https://lmarena.ai/?leaderboard; the sources for the cost claim are https://x.com/tsarnick/status/1856446610974355632 and https://www.tomshardware.com/tech-industry/artificial-intelligence/chinese-company-trained-gpt-4-rival-with-just-2-000-gpus-01-ai-spent-usd3m-compared-to-openais-usd80m-to-usd100m, and we probably should similarly take this with a grain of salt).
But all this does seem to be well within what’s possible. Here is the famous https://github.com/KellerJordan/modded-nanogpt ongoing competition, and it took people about 8 months to accelerate Andrej Karpathy’s PyTorch GPT-2 trainer from llm.c by 14x on a 124M parameter GPT-2 (what’s even more remarkable is that almost all that acceleration is due to better sample efficiency with the required training data dropping from 10 billion tokens to 0.73 billion tokens on the same training set with the fixed order of training tokens).
Some of the techniques used by the community pursuing this might not scale to really large models, but most of them probably would scale (as we see in their mid-Oct experiment demonstrating scaling of what has been 3-4x acceleration back then to the 1.5B version).
So when an org is claiming 10x-20x efficiency jump compared to what it presumably took a year or more ago, I am inclined to say, “why not, and probably the leaders are also in possession of similar techniques now, even if they are less pressed by compute shortage”.
The real question is how fast will these numbers continue to go down for the similar levels of performance… It’s has been very expensive to be the very first org achieving a given new level, but the cost seems to be dropping rapidly for the followers...
The baseline is weak, the 8 months is just catching up to the present. They update the architecture (giving maybe a 4x compute multiplier), shift to a more compute optimal tokens/parameter ratio (1.5x multiplier). Maybe there is another 2x from the more obscure changes (which are still in the literature, so the big labs have the opportunity to measure how useful they are, select what works).
It’s much harder to improve on GPT-4 or Llama-3 that much.
That’s just in the rules of the game, the number of model parameters isn’t allowed to change, so in order to reduce training FLOPs (preserving perplexity) they reduce the amount of data. It also incidentally improves optimality of tokens/parameter ratio, though at 0.73B tokens it already overshoots, turning the initial overtrained 10B token model into a slightly undertrained model.