GPT-5 is evaluated as if it was scaling up compute in a way that it doesn’t. In various ways people are assuming it ‘cost’ far more than it did.
Even if it’s a “small” model (as the balance of evidence suggests), it doesn’t follow that it didn’t cost a lot. Suppose gpt-5-thinking is a 1-2T total param, 250B active param model, a shape that would’ve been compute optimal for some 2023 training systems, but it’s 10x overtrained using 2024 compute, and then it was RLVRed for the same amount of GPU-time as pretraining. Then it could well cost about $1bn (at $2-3 per H100-hour). It would take an unlikely 300T tokens, but then there’s already gpt-oss-120b that apparently needed 100T-200T tokens, and this is still within the forgiving 5x repetition of a plausible amount of natural data.
I’m assuming 120 tokens/param compute optimal, anchoring to Llama 3 405B’s dense 40 tokens/param, increased 3x to account for 1:8 sparsity. At 5e25 FLOPs (2023 compute) this asks for 260B active params and 2T total, trained for 31T tokens. Overtrained 10x, this would need 5e26 FLOPs and 310T tokens, without changing model shape. At 40% compute utilization, this is about 175e6 H100-hours (in FP8), or 2.3 months on a 100K H100s training system. If the same amount of time was used for RLVR, this is another 175 million H100-hours (with fewer useful FLOPs), for the total of 350M H100-hours.
At $2-3 per H100-hour, this is $700M to $1bn, in the same sense that DeepSeek-V3/R1 is $5-7M. That is, various surrounding activities probably cost notably more than the final training runs that construct the models, though for the $1bn model it might just be comparable, while for the $6M model it would be much more.
From a different angle, they spent something like 8 billion dollars on training compute while training GPT-5, so if GPT-5 was cheap to train, where did the billions go?
Even if it’s a “small” model (as the balance of evidence suggests), it doesn’t follow that it didn’t cost a lot. Suppose gpt-5-thinking is a 1-2T total param, 250B active param model, a shape that would’ve been compute optimal for some 2023 training systems, but it’s 10x overtrained using 2024 compute, and then it was RLVRed for the same amount of GPU-time as pretraining. Then it could well cost about $1bn (at $2-3 per H100-hour). It would take an unlikely 300T tokens, but then there’s already gpt-oss-120b that apparently needed 100T-200T tokens, and this is still within the forgiving 5x repetition of a plausible amount of natural data.
I’m assuming 120 tokens/param compute optimal, anchoring to Llama 3 405B’s dense 40 tokens/param, increased 3x to account for 1:8 sparsity. At 5e25 FLOPs (2023 compute) this asks for 260B active params and 2T total, trained for 31T tokens. Overtrained 10x, this would need 5e26 FLOPs and 310T tokens, without changing model shape. At 40% compute utilization, this is about 175e6 H100-hours (in FP8), or 2.3 months on a 100K H100s training system. If the same amount of time was used for RLVR, this is another 175 million H100-hours (with fewer useful FLOPs), for the total of 350M H100-hours.
At $2-3 per H100-hour, this is $700M to $1bn, in the same sense that DeepSeek-V3/R1 is $5-7M. That is, various surrounding activities probably cost notably more than the final training runs that construct the models, though for the $1bn model it might just be comparable, while for the $6M model it would be much more.
From a different angle, they spent something like 8 billion dollars on training compute while training GPT-5, so if GPT-5 was cheap to train, where did the billions go?