1e26 FLOP would have had a significant opportunity cost.
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn’t seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It’s difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would’ve been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference).
I don’t think it’s a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.
The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta’s 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn’t seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It’s difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would’ve been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference).
The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta’s 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.