Oh yeah I forgot about Meta. As for DeepSeek: Will they not get a ton more compute in the next year or so? I imagine they’ll have an easy time raising money and getting government to cut red tape for them now that they’ve made international news and become the bestselling app.
In principle sufficiently granular MoEs keep matrices at a manageable size, and critical minibatch size scales quickly enough in the first several trillion tokens of pretraining that relatively small scale-up world sizes (from poor inter-chip networking and weaker individual chips) is not a barrier. So unconscionable numbers of weaker chips should still be usable (at a good compute utilization) in frontier training going forward. Still a major hurdle, that is even more expensive and complicated.
Oh yeah I forgot about Meta. As for DeepSeek: Will they not get a ton more compute in the next year or so? I imagine they’ll have an easy time raising money and getting government to cut red tape for them now that they’ve made international news and become the bestselling app.
In principle sufficiently granular MoEs keep matrices at a manageable size, and critical minibatch size scales quickly enough in the first several trillion tokens of pretraining that relatively small scale-up world sizes (from poor inter-chip networking and weaker individual chips) is not a barrier. So unconscionable numbers of weaker chips should still be usable (at a good compute utilization) in frontier training going forward. Still a major hurdle, that is even more expensive and complicated.