Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!
…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.
The hypothesis that Gopher is better than the other two mainly because of better training data seems like a totally viable hypothesis to me. For example, Gopher trained on 20× more books, presumably due to Google’s mountain of proprietary scanned book data. (Gopher trained on MassiveText which has 4M books adding up to 2.1TB of text (27% sampling proportion), while the other two used The Pile which has 100.96 GiB of book text.) Books are probably very important, but the datasets differ in other ways too. MassiveText has 2.7TB of news (10% sampling proportion), presumably from decades of Google News, whereas The Pile seems to have only whatever news articles showed up in the general web scrape. Etc.
The hypothesis that Gopher is better mainly because DeepMind has more secret sauce or better-tuned hyperparameters or whatever seems also like a totally viable hypothesis, as far as I know.
So I don’t think this is very strong evidence either way, and indeed if anything I would suggest that it’s pushing a bit in the direction of data over algorithms, especially given that Gopher was earlier. Right? Sorry if I’m misunderstanding.
Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!
…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.
The hypothesis that Gopher is better than the other two mainly because of better training data seems like a totally viable hypothesis to me. For example, Gopher trained on 20× more books, presumably due to Google’s mountain of proprietary scanned book data. (Gopher trained on MassiveText which has 4M books adding up to 2.1TB of text (27% sampling proportion), while the other two used The Pile which has 100.96 GiB of book text.) Books are probably very important, but the datasets differ in other ways too. MassiveText has 2.7TB of news (10% sampling proportion), presumably from decades of Google News, whereas The Pile seems to have only whatever news articles showed up in the general web scrape. Etc.
The hypothesis that Gopher is better mainly because DeepMind has more secret sauce or better-tuned hyperparameters or whatever seems also like a totally viable hypothesis, as far as I know.
So I don’t think this is very strong evidence either way, and indeed if anything I would suggest that it’s pushing a bit in the direction of data over algorithms, especially given that Gopher was earlier. Right? Sorry if I’m misunderstanding.