In principle sufficiently granular MoEs keep matrices at a manageable size, and critical minibatch size scales quickly enough in the first several trillion tokens of pretraining that relatively small scale-up world sizes (from poor inter-chip networking and weaker individual chips) is not a barrier. So unconscionable numbers of weaker chips should still be usable (at a good compute utilization) in frontier training going forward. Still a major hurdle, that is even more expensive and complicated.
In principle sufficiently granular MoEs keep matrices at a manageable size, and critical minibatch size scales quickly enough in the first several trillion tokens of pretraining that relatively small scale-up world sizes (from poor inter-chip networking and weaker individual chips) is not a barrier. So unconscionable numbers of weaker chips should still be usable (at a good compute utilization) in frontier training going forward. Still a major hurdle, that is even more expensive and complicated.