Distributed training runs never manage to fully utilize nominal flops from hardware, and are easy to stuff up in other ways too, but I’d expect the chips themselves to be pretty well set out—it’s obvious early in the design stage if you’re going to be bottlenecked on something else.
Distributed training runs never manage to fully utilize nominal flops from hardware, and are easy to stuff up in other ways too, but I’d expect the chips themselves to be pretty well set out—it’s obvious early in the design stage if you’re going to be bottlenecked on something else.