Wow—if those stats are correct, the training of CodeGeeX used up to 1e24 nominal flops (2.56e14 flops/chip * 1536 chips * 2.6e6 seconds), which would put it a bit ahead of Chinchilla, although its seemingly lite on param count. But it is somewhat easier to tile a chip with fp16 units then it is to utilize them effectively, so the true useful flops may be lower.
Nonetheless, that’s quite surprising, impressive, and perhaps concerning.
Distributed training runs never manage to fully utilize nominal flops from hardware, and are easy to stuff up in other ways too, but I’d expect the chips themselves to be pretty well set out—it’s obvious early in the design stage if you’re going to be bottlenecked on something else.
Wow—if those stats are correct, the training of CodeGeeX used up to 1e24 nominal flops (2.56e14 flops/chip * 1536 chips * 2.6e6 seconds), which would put it a bit ahead of Chinchilla, although its seemingly lite on param count. But it is somewhat easier to tile a chip with fp16 units then it is to utilize them effectively, so the true useful flops may be lower.
Nonetheless, that’s quite surprising, impressive, and perhaps concerning.
Distributed training runs never manage to fully utilize nominal flops from hardware, and are easy to stuff up in other ways too, but I’d expect the chips themselves to be pretty well set out—it’s obvious early in the design stage if you’re going to be bottlenecked on something else.