Vladimir_Nesov comments on Daniel Tan’s Shortform

Vladimir_Nesov 11 Feb 2025 2:07 UTC
6 points
0

almost no difference between 180b vs 800b model, when r=1(table 4)

It’s a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that’s 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.