JanB comments on Google’s new 540 billion parameter language model

JanB 5 Apr 2022 9:34 UTC
10 points
0
Section 13 (page 47) discusses data/compute scaling and the comparison to Chinchilla. Some findings:
- PaLM 540B uses 4.3 x more compute than Chinchilla, and outperforms Chinchilla on downstream tasks.
- PaLM 540B is massively undertrained with regards to the data-scaling laws discovered in the Chinchilla paper. (unsurprisingly, training a 540B parameter model on enough tokens would be very expensive)
- within the set of (Gopher, Chinchilla, and there sizes of PaLM), the total amount of training compute seems to predict performance on downstream tasks pretty well (log-linear relationship). Gopher underperforms a bit.