Regarding the “Training Compute-Optimal Large Language Models” paper. Whether you should lengthen or shorten timelines depends on which is easier to scale up: number of parameters or data. The paper says that number of parameters are less important than previously thought. The question is why existing models are undertrained (use too little data/to many parameters). Either it is because of the old scaling laws paper which overestimated the importance of parameters, or it is because scaling up data is actually harder than scaling up parameters. If the latter is the case, this would lengthen timelines, not shorten them.
The question is why existing models are undertrained (use too little data/to many parameters). Either it is because of the old scaling laws paper which overestimated the importance of parameters, or it is because scaling up data is actually harder than scaling up parameters.
It was the former. All of those models were in the <1 epoch regime, so they didn’t even use all of the data they already had (much less the data they could’ve collected before hitting marginal gain parity in spending resources on either another unit of compute or another unit of data).
Regarding the “Training Compute-Optimal Large Language Models” paper. Whether you should lengthen or shorten timelines depends on which is easier to scale up: number of parameters or data. The paper says that number of parameters are less important than previously thought. The question is why existing models are undertrained (use too little data/to many parameters). Either it is because of the old scaling laws paper which overestimated the importance of parameters, or it is because scaling up data is actually harder than scaling up parameters. If the latter is the case, this would lengthen timelines, not shorten them.
It was the former. All of those models were in the <1 epoch regime, so they didn’t even use all of the data they already had (much less the data they could’ve collected before hitting marginal gain parity in spending resources on either another unit of compute or another unit of data).