The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.
The prior reference is a Dylan Patel tweet from Nov 2024, in the wake of R1-Lite-Preview release:
Deepseek has over 50k Hopper GPUs to be clear.
People need to stop acting like they only have that 10k A100 cluster.
They are omega cracked on ML research and infra management but they aren’t doing it with that many fewer GPUs
DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
This seems unlikely to be a lie, the reputational damage would’ve motivated not mentioning amount of compute instead, but the most interesting thing about DeepSeek-V3 is precisely this claim, that its quality is possible with so little compute.
Certainly designing the architecture, the data mix, and the training process that made it possible required much more compute than the final training run, so in total it cost much more to develop than $6 million. And the 50K H100/H800 system is one way to go about that, though renting a bunch of 512-GPU instances from various clouds probably would’ve sufficed as well.
I don’t actually know about DeepSeek V3, I just felt “if I pointed out the $6 million claim in my argument, I shouldn’t hide the fact I watched a video which made myself doubt it.”
I wanted to include the video as a caveat just in case the $6 million was wrong.
Your explanation suggests the $6 million is still in the ballpark (for the final training run), so the concerns about a “software only singularity” are still very realistic.
The prior reference is a Dylan Patel tweet from Nov 2024, in the wake of R1-Lite-Preview release:
DeepSeek explicitly states that
This seems unlikely to be a lie, the reputational damage would’ve motivated not mentioning amount of compute instead, but the most interesting thing about DeepSeek-V3 is precisely this claim, that its quality is possible with so little compute.
Certainly designing the architecture, the data mix, and the training process that made it possible required much more compute than the final training run, so in total it cost much more to develop than $6 million. And the 50K H100/H800 system is one way to go about that, though renting a bunch of 512-GPU instances from various clouds probably would’ve sufficed as well.
I see, thank you for the info!
I don’t actually know about DeepSeek V3, I just felt “if I pointed out the $6 million claim in my argument, I shouldn’t hide the fact I watched a video which made myself doubt it.”
I wanted to include the video as a caveat just in case the $6 million was wrong.
Your explanation suggests the $6 million is still in the ballpark (for the final training run), so the concerns about a “software only singularity” are still very realistic.