Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”