For Claude 3.5, Amodei says the training time cost “a few $10M’s”, which translates to between 1e25 FLOPs (H100, $40M, $4/hour, 30% utilization, BF16) and 1e26 FLOPs (H100, $80M, $2/hour, 50% utilization, FP8), my point estimate is 4e25 FLOPs.
GPT-4o was trained around the same time (late 2023 to very early 2024), and given that the current OpenAI training system seems to take the form of three buildings totaling 100K H100s (the Goodyear, Arizona site), they probably had one of those for 32K H100s, which in 3 months at 40% utilization in BF16 gives 1e26 FLOPs.
Gemini 2.0 was released concurrently with the announcement of general availability of 100K TPUv6e clusters (the instances you can book are much smaller), so they probably have several of them, and Jeff Dean’s remarks suggest they might’ve been able to connect some of them for purposes of pretraining. Each one can contribute 3e26 FLOPs (conservatively assuming BF16). Hassabis noted on some podcast a few months back that scaling compute 10x each generation seems like a good number to fight through the engineering challenges. Gemini 1.0 Ultra was trained on either 77K TPUv4 (according to The Information) or 14 4096-TPUv4 pods (according to EpochAI’s quote from SemiAnalysis), so my point estimate for Gemini 1.0 Ultra is 8e25 FLOPs.
This gives 6e26-9e26 FLOPs for Gemini 2.0 (from 2-3 100K TPUv6e clusters). But unclear if this is what went into Gemini 2.0 Pro or if there is also an unmentioned Gemini 2.0 Ultra down the line.
It seems that we are already at the GPT 4.5 level? Except that reasoning models have confused everything, and increasing OOM on output can have the same effect as ~OOM on training, as I understand it.
By the way, you’ve analyzed the scaling of pretraining a lot. But what about inference scaling? It seems that o3 has already used thousands of GPUs to solve tasks in ARC-AGI.
For Claude 3.5, Amodei says the training time cost “a few $10M’s”, which translates to between 1e25 FLOPs (H100, $40M, $4/hour, 30% utilization, BF16) and 1e26 FLOPs (H100, $80M, $2/hour, 50% utilization, FP8), my point estimate is 4e25 FLOPs.
GPT-4o was trained around the same time (late 2023 to very early 2024), and given that the current OpenAI training system seems to take the form of three buildings totaling 100K H100s (the Goodyear, Arizona site), they probably had one of those for 32K H100s, which in 3 months at 40% utilization in BF16 gives 1e26 FLOPs.
Gemini 2.0 was released concurrently with the announcement of general availability of 100K TPUv6e clusters (the instances you can book are much smaller), so they probably have several of them, and Jeff Dean’s remarks suggest they might’ve been able to connect some of them for purposes of pretraining. Each one can contribute 3e26 FLOPs (conservatively assuming BF16). Hassabis noted on some podcast a few months back that scaling compute 10x each generation seems like a good number to fight through the engineering challenges. Gemini 1.0 Ultra was trained on either 77K TPUv4 (according to The Information) or 14 4096-TPUv4 pods (according to EpochAI’s quote from SemiAnalysis), so my point estimate for Gemini 1.0 Ultra is 8e25 FLOPs.
This gives 6e26-9e26 FLOPs for Gemini 2.0 (from 2-3 100K TPUv6e clusters). But unclear if this is what went into Gemini 2.0 Pro or if there is also an unmentioned Gemini 2.0 Ultra down the line.
Thank you. In conditions of extreme uncertainty about the timing and impact of AGI, it’s nice to know at least something definite.
It seems that we are already at the GPT 4.5 level? Except that reasoning models have confused everything, and increasing OOM on output can have the same effect as ~OOM on training, as I understand it.
By the way, you’ve analyzed the scaling of pretraining a lot. But what about inference scaling? It seems that o3 has already used thousands of GPUs to solve tasks in ARC-AGI.