These are somewhat awkward benchmarks because they don’t actually measure downstream usefulness at software engineering or AI research. In particular, these tasks might not measure improvements in RL which can have huge effects on usefulness and have seen fast algorithmic progress.
Can we instead use SWE-bench or METR’s task suite?
For instance, here is a proposed bet:
GPT-4 was released in March 2023 (2 years ago). So, we’d expect a model which used 10x less FLOP to perform similarly well (or better) on agentic tasks (like SWE-bench or METR’s task suite).
Oh wait, there already is such a model! Deepseek-V3 / R1 is IMO clearly better than GPT-4 on these tasks (and other tasks) while using <10x GPT-4 flop and being released within 2 years. So bet resolved?
Edit: more like 6x less flop actually, so this is a bit messy and would need to lean on better performance. People don’t seem to bother training compute optimal models with ~10x less than GPT-4 flop models these days...
Actually, I think Deepseek-V3 also does better than GPT-4 on MMLU, though we can’t compare perplexity. So, seems ~resolved either way, at least for progress in the last 2 years and if you’re fine with assuming that DeepSeek-V3 isn’t rigged or using distillation.
The delta is much more extreme if instead of looking at software engineering you look at competitive programming or math.
Sure. Epoch estimates 2e25 flop for GPT-4 and 3.4e24 for deepseek V3. So a bit less than 10x actually, but quite close. (And V3 is substantially better.) R1 is around 1⁄6 of deepseek V3 cost.
R1 can’t possibly be below V3 cost because it is inclusive? If I’m not mistaken, R1 is not trained from scratch, but I could be wrong.
Second, GPT-4 is not a compute-efficient model, afaik, which is why Chinchilla was my choice, not a random big model. Furthermore, V3 does not come all that close to meeting the requirement for how many less FLOPs a model needs to hit the 3.5x per year reduction (it is <6x smaller when 3.5x per year over 1.75 years implies about 9x smaller), let alone the 4.6x per year reduction the forecaster models imply.
So even if you try to find a best case scenario for a jump in compute efficiency that breaks the general failure to hit a consistent trend over longer than 1.5 year timelines (which is the crux), it does not meet the standard. And it is commonly believed that V3 itself took advantage of undisclosed distillation, although I won’t press that.
So we have GPT-4, a non-effiency-frontier model vs. a questionably independent V3 that does not even hit the target claimed that also represents the biggest compute-efficiency upgrade in the last two years (or at least R1 might).
I don’t see how my bet as stated is likely to fall anytime soon.
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
These are somewhat awkward benchmarks because they don’t actually measure downstream usefulness at software engineering or AI research. In particular, these tasks might not measure improvements in RL which can have huge effects on usefulness and have seen fast algorithmic progress.
Can we instead use SWE-bench or METR’s task suite?
For instance, here is a proposed bet:
GPT-4 was released in March 2023 (2 years ago). So, we’d expect a model which used 10x less FLOP to perform similarly well (or better) on agentic tasks (like SWE-bench or METR’s task suite).
Oh wait, there already is such a model! Deepseek-V3 / R1 is IMO clearly better than GPT-4 on these tasks (and other tasks) while using <10x GPT-4 flop and being released within 2 years. So bet resolved?
Edit: more like 6x less flop actually, so this is a bit messy and would need to lean on better performance. People don’t seem to bother training compute optimal models with ~10x less than GPT-4 flop models these days...
Actually, I think Deepseek-V3 also does better than GPT-4 on MMLU, though we can’t compare perplexity. So, seems ~resolved either way, at least for progress in the last 2 years and if you’re fine with assuming that DeepSeek-V3 isn’t rigged or using distillation.
The delta is much more extreme if instead of looking at software engineering you look at competitive programming or math.
Do you have a source because everything I can find implies approximately equal FLOPs in e2e training costs for GPT-4 and R1 or Deepseek-V3?
Sure. Epoch estimates 2e25 flop for GPT-4 and 3.4e24 for deepseek V3. So a bit less than 10x actually, but quite close. (And V3 is substantially better.) R1 is around 1⁄6 of deepseek V3 cost.
R1 can’t possibly be below V3 cost because it is inclusive? If I’m not mistaken, R1 is not trained from scratch, but I could be wrong.
Second, GPT-4 is not a compute-efficient model, afaik, which is why Chinchilla was my choice, not a random big model. Furthermore, V3 does not come all that close to meeting the requirement for how many less FLOPs a model needs to hit the 3.5x per year reduction (it is <6x smaller when 3.5x per year over 1.75 years implies about 9x smaller), let alone the 4.6x per year reduction the forecaster models imply.
So even if you try to find a best case scenario for a jump in compute efficiency that breaks the general failure to hit a consistent trend over longer than 1.5 year timelines (which is the crux), it does not meet the standard. And it is commonly believed that V3 itself took advantage of undisclosed distillation, although I won’t press that.
So we have GPT-4, a non-effiency-frontier model vs. a questionably independent V3 that does not even hit the target claimed that also represents the biggest compute-efficiency upgrade in the last two years (or at least R1 might).
I don’t see how my bet as stated is likely to fall anytime soon.
Yes, I meant 1⁄6 additional cost which is ~negligable.
Importantly, it is much better than GPT-4 on the relevant downstream tasks.
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
We only have leaked numbers confirming reasonably efficient training but GPT-4 is widely believed to be a quite efficient model for the time, and notably wasn’t matched by competitors for a while.
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”