Hopenope comments on Daniel Tan’s Shortform

Hopenope 10 Feb 2025 22:47 UTC
5 points
0
The recurrent paper is actually scary, but some of the stuff there are actually questionable. is 8 layers enough for a 3.5b model? qwen 0.5b has 24 layers.there is also almost no difference between 180b vs 800b model, when r=1(table 4). is this just a case of overcoming insufficient number of layers here?
- Vladimir_Nesov 11 Feb 2025 2:07 UTC
  6 points
  0
  Parent
  
  almost no difference between 180b vs 800b model, when r=1(table 4)
  
  It’s a 3B parameter model, so training it for 180B tokens already overtrains it maybe 3x, and training for 800B tokens overtrains it 13x. The loss of compute efficiency from the latter is about 1.6x more than from the former, with 4.4x more raw compute, so should have 2.7x more in effective compute, or act like a compute optimal model that’s 1.6x larger, trained on 1.6x more tokens. So the distinction is smaller than 180 vs. 800.