Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
Importantly, it is much better than GPT-4 on the relevant downstream tasks.
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.
We only have leaked numbers confirming reasonably efficient training but GPT-4 is widely believed to be a quite efficient model for the time, and notably wasn’t matched by competitors for a while.
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”