R1 can’t possibly be below V3 cost because it is inclusive? If I’m not mistaken, R1 is not trained from scratch, but I could be wrong.
Second, GPT-4 is not a compute-efficient model, afaik, which is why Chinchilla was my choice, not a random big model. Furthermore, V3 does not come all that close to meeting the requirement for how many less FLOPs a model needs to hit the 3.5x per year reduction (it is <6x smaller when 3.5x per year over 1.75 years implies about 9x smaller), let alone the 4.6x per year reduction the forecaster models imply.
So even if you try to find a best case scenario for a jump in compute efficiency that breaks the general failure to hit a consistent trend over longer than 1.5 year timelines (which is the crux), it does not meet the standard. And it is commonly believed that V3 itself took advantage of undisclosed distillation, although I won’t press that.
So we have GPT-4, a non-effiency-frontier model vs. a questionably independent V3 that does not even hit the target claimed that also represents the biggest compute-efficiency upgrade in the last two years (or at least R1 might).
I don’t see how my bet as stated is likely to fall anytime soon.
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
R1 can’t possibly be below V3 cost because it is inclusive? If I’m not mistaken, R1 is not trained from scratch, but I could be wrong.
Second, GPT-4 is not a compute-efficient model, afaik, which is why Chinchilla was my choice, not a random big model. Furthermore, V3 does not come all that close to meeting the requirement for how many less FLOPs a model needs to hit the 3.5x per year reduction (it is <6x smaller when 3.5x per year over 1.75 years implies about 9x smaller), let alone the 4.6x per year reduction the forecaster models imply.
So even if you try to find a best case scenario for a jump in compute efficiency that breaks the general failure to hit a consistent trend over longer than 1.5 year timelines (which is the crux), it does not meet the standard. And it is commonly believed that V3 itself took advantage of undisclosed distillation, although I won’t press that.
So we have GPT-4, a non-effiency-frontier model vs. a questionably independent V3 that does not even hit the target claimed that also represents the biggest compute-efficiency upgrade in the last two years (or at least R1 might).
I don’t see how my bet as stated is likely to fall anytime soon.
Yes, I meant 1⁄6 additional cost which is ~negligable.
Importantly, it is much better than GPT-4 on the relevant downstream tasks.
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
We only have leaked numbers confirming reasonably efficient training but GPT-4 is widely believed to be a quite efficient model for the time, and notably wasn’t matched by competitors for a while.
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”