Vladimir_Nesov comments on Daniel Kokotajlo’s Shortform

Vladimir_Nesov 18 Feb 2025 21:40 UTC
7 points
2
In principle sufficiently granular MoEs keep matrices at a manageable size, and critical minibatch size scales quickly enough in the first several trillion tokens of pretraining that relatively small scale-up world sizes (from poor inter-chip networking and weaker individual chips) is not a barrier. So unconscionable numbers of weaker chips should still be usable (at a good compute utilization) in frontier training going forward. Still a major hurdle, that is even more expensive and complicated.