[Question] Updates on scaling laws for foundation models from ′ Transcending Scaling Laws with 0.1% Extra Compute’

Nick_Greig18 Nov 2022 12:46 UTC

LW: 15 AF: 3

I am not sure if this paper is flying under the radar for many people, but has anyone read Transcending Scaling Laws with 0.1% Extra Compute? If so, how do you think it compares to the scaling laws presented by Deepmind’s An empirical analysis of compute-optimal large language model training? Does it make you rethink the importance of dataset size (again)?

Nick_Greig18 Nov 2022 12:46 UTC

LW: 15 AF: 3

2 comments1 min readLW link

Scaling Laws AI

LawrenceC 21 Nov 2022 0:55 UTC
LW: 7 AF: 4
0
AF
tl;dr: The shape of the curve probably doesn’t change, but the compute-optimal LM training will use less data than the Chinchilla scaling law suggests.
One of the takeaways from the last two years of LM progress is that GPT-3/Chinchilla’s next-token-prediction objective is not the most efficient way to use data.* Instead, objectives require the model to infill missing tokens in the middle of a text string, like the T5 objective or the UL2 objective, are much more efficient per unit data.
Figure 2 of the Tay et al UL2R paper shows how UL2 finetuning serves as either a multiple or a constant increase in training flops. Assuming that the improvement holds across the board, this means that UL2 finetuning makes models ~1.5-3x more data efficient. So if before, the optimal trade off for X flops was Y params times Z tokens, with a better objective (or finetuning the objective better), we might see 1.5 Y params and 0.66 Z tokens.
It’s worth noting that this still implies a linear relationship between the optimal param count and token count, it’s just that if you use a better objective it’s better to use more params and fewer tokens than what the next-token log loss—based Chinchilla scaling laws would predict.

* Arguably, we knew this from BERT, where you’d get better finetuned performance on downstream tasks if you pretrained with bidirectional objectives, but I think the result that the next-token prediction objective is worse for text generation tasks is new.
Sheikh Abdur Raheem Ali 19 Nov 2022 22:19 UTC
1 point
0
Mostly already did my updates when “Efficient Training of Language Models to Fill in the Middle” https://arxiv.org/abs/2207.14255 came out.

No comments.