Vladimir_Nesov comments on AI 2027: What Superintelligence Looks Like

Vladimir_Nesov 8 Apr 2025 17:03 UTC
5 points
0
A 100K H100s training system is a datacenter campus that costs about $5bn to build. You can use it to train a 3e26 FLOPs model in ~3 months, and that time costs about $500M. So the “training cost” is $500M, not $5bn, but in order to do the training you need exclusive access to a giant 100K H100s datacenter campus for 3 months, which probably means you need to build it yourself, which means you still need to raise the $5bn. Outside these 3 months, it can be used for inference or training experiments, so the $5bn is not wasted, it’s just a bit suboptimal to build that much compute in a single place if your goal is primarily to serve inference around the world, because it will be quite far from most places in the world. (The 1e27 estimate is the borderline implausible upper bound, and it would take more than $500M in GPU-time to reach, 3e26 BF16 FLOPs or 6e26 FP8 FLOPs are more likely with just the Goodyear campus).

Abilene site of Stargate is only building about 100K chips (2 buildings, ~1500 Blackwell NVL72 racks, ~250 MW, ~$6bn) by summer 2025, most of the rest of the 1.2 GW buildout happens in 2026. The 2025 system is sufficient to train a 1e27 BF16 FLOPs model (or 2e27 FP8 FLOPs).

Rubin arriving 1.5 years after Blackwell means you have 1.5 years of revenue growth to use as an argument about valuation to raise money for Rubin, not 1 year. The recent round raised money for a $30bn datacenter campus, so if revenue actually keeps growing at 3x per year, then it’ll grow 5x in 1.5 years. As the current expectation is $12bn, in 1.5 years the expectation would be $60bn (counting from an arbitrary month, without sticking to calendar years). And 5x of $30bn is $150bn, anchoring to revenue growth, though actually raising this kind of absurd amount of money is a separate matter that also needs to happen.

If miraculously Nvidia actually ships 30K Rubin racks in early 2027 (to a single customer), training will only happen a bit later, that is you’ll only have an actual 5e28 BF16 FLOPs model by mid-2027, not in 2026. Building the training system costs $150bn, but the minimum necessary cost of 3-4 months of training system’s time is only about $15bn.

More likely this only happens several months later, in 2028, and at that point there’s the better Rubin Ultra NVL576 (Kyber) coming out, so that’s a reason to avoid tying up the $150bn in capital in the inferior non-Ultra Rubin NVL144 racks and instead wait for Rubin Ultra, only expending somewhat less than $150bn on non-Ultra Rubin NVL144 in 2027, meaning only a ~2e28 BF16 FLOPs model in 2027 (and at this lower level of buildout it’s more likely to actually happen in 2027). Of course the AI 2027 timeline assumes all-encompassing capability progress in 2027, which means AI companies won’t be saving money for next year, but hardware production still needs to ramp, money won’t be able to speed it up that much on the timescale of months.
- SorenJ 8 Apr 2025 18:12 UTC
  1 point
  0
  Parent
  Thank you very much, this is so helpful! I want to know if I am understanding things correctly again, so please correct me if I am wrong on any of the following:
  By “used for inference,” this just means basically letting people use the model? Like when I go to the chatgpt website, I am using the datacenter campus computers that were previously used for training? (Again, please forgive my noobie questions.)
  
  For 2025, Abilene is building a 100,000-chip campus. This is plausibly around the same number of chips that were used to train the~3e26 FLOPs GPT4.5 at the Goodyear campus. However, the Goodyear campus was using H100 chips, but Abilene will be using Blackwell NVL72 chips. These improved chips means that for the same number of chips we can now train a 1e27 FLOPs model instead of just a 3e26 model. The chips can be built by summer 2025, and a new model trained by around end of year 2025.
  
  1.5 years after the Blackwell chips, the new Rubin chip will arrive. The time is now currently ~2027.5.
  
  Now a few things need to happen:
  1. The revenue growth rate from 2024 to 2025 of 3x/year continues to hold. In that case, after 1.5 years, we can expect $60bn in revenue by 2027.5.
  2. The ‘raised money’ : ‘revenue’ ratio of $30bn : $12bn in 2025 holds again. In that case we have $60bn x 2.5 = $150bn.
  3. The decision would need to be made to purchase the $150 bn worth of Rubin chips (and Nvidia would need to be able to supply this.)
  More realistically, assuming (1) and (2) hold, it makes more sense to wait until the Rubin Ultra comes out to spend the $150bn on.
  Or, some type of mixed buildout would occur, some of that $150bn in 2027.5 would use the Rubin non-Ultra to train a 2e28 FLOPs model, and the remainder would be used to build an even bigger model in 2028 that uses Rubin Ultra.
  - Vladimir_Nesov 8 Apr 2025 18:41 UTC
    3 points
    0
    Parent
    “Revenue by 2027.5” needs to mean “revenue between summer 2026 and summer 2027″. And the time when the $150bn is raised needs to be late 2026, not “2027.5”, in order to actually build the thing by early 2027 and have it completed for several months already by mid to late 2027 to get that 5e28 BF16 FLOPs model. Also Nvidia would need to have been expecting this or similar sentiment elsewhere months to years in advance, as everyone in the supply chain can be skeptical that this kind of money actually materializes by 2027, and so that they need to build additional factories in 2025-2026 to meet the hypothetical demand of 2027.
    
    By “used for inference,” this just means basically letting people use the model?
    
    It means using the compute to let people use various models, not necessarily this one, while the model itself might end up getting inferenced elsewhere. Numerous training experiments can also occupy a lot of GPU-time, but they will be smaller than the largest training run, and so the rest of the training system can be left to do other things. In principle some AI companies might offer cloud provider services and sell the time piecemeal on the older training systems that are no longer suited for training frontier models, but very likely they have a use for all that compute themselves.