The question is why existing models are undertrained (use too little data/to many parameters). Either it is because of the old scaling laws paper which overestimated the importance of parameters, or it is because scaling up data is actually harder than scaling up parameters.
It was the former. All of those models were in the <1 epoch regime, so they didn’t even use all of the data they already had (much less the data they could’ve collected before hitting marginal gain parity in spending resources on either another unit of compute or another unit of data).
It was the former. All of those models were in the <1 epoch regime, so they didn’t even use all of the data they already had (much less the data they could’ve collected before hitting marginal gain parity in spending resources on either another unit of compute or another unit of data).