almost no difference between 180b vs 800b model, when r=1(table 4)
It’s a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that’s 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.
It’s a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that’s 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.