The Scaling Laws for Neural Language Model’s paper says that the optimal model size scales 5x with 10x more compute. So to be more precise, using GPT-3 numbers (4000 PetaFLOPs/days for 200 billions parameters), a 100 trillion parameters model would require 4000 ExaFLOPs/days. (using GPT-3 architecture, so no sparse or linear transformer improvements). To be fair, the Scaling Law papers also predicts a breaking down of the scaling laws around 1 trillion parameters.
The peak F16 performance of Fugaku seems to be 2 exaFLOPs. If we are generous and we account for 30% peak hardware utilization in training a transformer model, the same efficiency of an optimized large GPU cluster, it would take around 6000 days (20 years).
Fugaku seems to have cost 1B$, which leads me to believe that GPUs are much better at F16 flops per $ than the ARM SVE architecture they use. In any case, even if we use GPUs, it is clear we are some years away if we don’t find a more efficient neural language model architecture.