If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me.
We only have leaked numbers confirming reasonably efficient training but GPT-4 is widely believed to be a quite efficient model for the time, and notably wasn’t matched by competitors for a while.
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”
We only have leaked numbers confirming reasonably efficient training but GPT-4 is widely believed to be a quite efficient model for the time, and notably wasn’t matched by competitors for a while.
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”