Fergus Argyll comments on AI 2027: What Superintelligence Looks Like

Fergus Argyll 8 Apr 2025 2:10 UTC
3 points
0
I have a question that I didn’t see anyone ask, but I don’t frequent this site enough to know if it was mentioned somewhere.

Are we sure there will be a 2-OOMs-bigger training run at all?
After the disappointment that was GPT 4.5, will investors give them the 100B (according to TFA) they need for that? In general I’d like to see more discussion about the financial side of the AGI Race. How will OpenBrain get the funding to train Agent-4?
I’ve been looking for markets on manifold to bet this and I couldn’t find a good one. I would bet we don’t get a 2-OOMs-Bigger model until at least 2027 by which chip costs come down enough. My prediction would be OpenBrain etc would focus on fine tuning / wrappers / products / ux /ui etc. for the next 2-3 years.
- Vladimir_Nesov 8 Apr 2025 3:34 UTC
  7 points
  0
  Parent
  GPT-4.5 might’ve been trained on 100K H100s of the Goodyear Microsoft site ($4-5bn, same as first phase of Colossus), about 3e26 FLOPs (though there are hints in the announcement video it could’ve been trained in FP8 and on compute from more than one location, which makes up to 1e27 FLOPs possible in principle).
  
  Abilene site of Crusoe/Stargate/OpenAI will have 1 GW of Blackwell servers in 2026, about 6K-7K racks, possibly at $4M per rack all-in, for the total of $25-30bn, which they’ve already raised money for (mostly from SoftBank). They are projecting about $12bn in revenue for 2025. If used as a single training system, it’s enough to train models for 5e27 BF16 FLOPs (or 1e28 FP8 FLOPs).
  
  The AI 2027 timeline assumes reliable agentic models work out, so revenue continues scaling, with the baseline guess of 3x per year. If Rubin NVL144 arrives 1.5 years after Blackwell NVL72, that’s about 5x increase in expected revenue. If that somehow translates into proportional investment in datacenter construction, that might be enough to buy $150bn worth of Rubin NVL144 racks, say at $5M per rack all-in, which is 30K racks and 5 GW. Compared to Blackwell NVL72, that’s 2x more BF16 compute per rack (and 3.3x more FP8 compute). This makes the Rubin datacenter of early 2027 sufficient to train a 5e28 BF16 FLOPs model (or 1.5e29 FP8 FLOPs) later in 2027. Which is a bit more than 100x the estimate for GPT-4.5.
  
  (I think this is borderline implausible technologically if only the AI company believes in the aggressive timeline in advance, and ramping Rubin to 30K racks for a single company will take more time. Getting 0.5-2 GW of Rubin racks by early 2027 seems more likely. Using Blackwell at that time means ~2x lower performance for the same money, undercutting the amount of compute that will be available in 2027-2028 in the absence of an intelligence explosion, but at least it’s something money will be able to buy. And of course this still hinges on the revenue actually continuing to grow, and translating into capital for the new datacenter.)
  - SorenJ 8 Apr 2025 15:05 UTC
    2 points
    0
    Parent
    Do I have the high level takeaways here correct? Forgive my use of the phrase “Training size,” but I know very little about diferent chips, so I am trying to distill it down to simple numbers.
    
    2024:
    a) OpenAI revenue: $3.7 billion.
    b) Training size: 3e26 to 1e27 FLOPs.
    c) Training cost: $4-5 billion.
    
    2025 Projections:
    a) OpenAI revenue: $12 billion.
    b) Training size: 5e27 FLOPs.
    b) Training cost: $25-30 billion.
    2026 Projections:
    a) OpenAI revenue: ~$36 billion to $60 billlion.
    At this point I am confused: why you are saying Rubin arriving after Blackwell would make the revenue more like $60 billion? Again, I know very little about chips. Wouldn’t the arrival of a different chip also change OpenAIs cost?
    b) Training size: 5e28 FLOPs.
    c) Training cost: $150 billion.
    Assuming investors are willing to take the same ratio of revenue : training cost as before, this would predict $70 billion to $150 billion. In other words, to get to the $150 billion mark requires that Rubin arrives after Blackwell, openAI makes revenue $60 billion in revenue, and investors take a 2.5 multiplier for $60 x 2.5 = $150 billion.
    Is there anything else that I missed?
    - Vladimir_Nesov 8 Apr 2025 17:03 UTC
      5 points
      0
      Parent
      A 100K H100s training system is a datacenter campus that costs about $5bn to build. You can use it to train a 3e26 FLOPs model in ~3 months, and that time costs about $500M. So the “training cost” is $500M, not $5bn, but in order to do the training you need exclusive access to a giant 100K H100s datacenter campus for 3 months, which probably means you need to build it yourself, which means you still need to raise the $5bn. Outside these 3 months, it can be used for inference or training experiments, so the $5bn is not wasted, it’s just a bit suboptimal to build that much compute in a single place if your goal is primarily to serve inference around the world, because it will be quite far from most places in the world. (The 1e27 estimate is the borderline implausible upper bound, and it would take more than $500M in GPU-time to reach, 3e26 BF16 FLOPs or 6e26 FP8 FLOPs are more likely with just the Goodyear campus).
      
      Abilene site of Stargate is only building about 100K chips (2 buildings, ~1500 Blackwell NVL72 racks, ~250 MW, ~$6bn) by summer 2025, most of the rest of the 1.2 GW buildout happens in 2026. The 2025 system is sufficient to train a 1e27 BF16 FLOPs model (or 2e27 FP8 FLOPs).
      
      Rubin arriving 1.5 years after Blackwell means you have 1.5 years of revenue growth to use as an argument about valuation to raise money for Rubin, not 1 year. The recent round raised money for a $30bn datacenter campus, so if revenue actually keeps growing at 3x per year, then it’ll grow 5x in 1.5 years. As the current expectation is $12bn, in 1.5 years the expectation would be $60bn (counting from an arbitrary month, without sticking to calendar years). And 5x of $30bn is $150bn, anchoring to revenue growth, though actually raising this kind of absurd amount of money is a separate matter that also needs to happen.
      
      If miraculously Nvidia actually ships 30K Rubin racks in early 2027 (to a single customer), training will only happen a bit later, that is you’ll only have an actual 5e28 BF16 FLOPs model by mid-2027, not in 2026. Building the training system costs $150bn, but the minimum necessary cost of 3-4 months of training system’s time is only about $15bn.
      
      More likely this only happens several months later, in 2028, and at that point there’s the better Rubin Ultra NVL576 (Kyber) coming out, so that’s a reason to avoid tying up the $150bn in capital in the inferior non-Ultra Rubin NVL144 racks and instead wait for Rubin Ultra, only expending somewhat less than $150bn on non-Ultra Rubin NVL144 in 2027, meaning only a ~2e28 BF16 FLOPs model in 2027 (and at this lower level of buildout it’s more likely to actually happen in 2027). Of course the AI 2027 timeline assumes all-encompassing capability progress in 2027, which means AI companies won’t be saving money for next year, but hardware production still needs to ramp, money won’t be able to speed it up that much on the timescale of months.
      - SorenJ 8 Apr 2025 18:12 UTC
        1 point
        0
        Parent
        Thank you very much, this is so helpful! I want to know if I am understanding things correctly again, so please correct me if I am wrong on any of the following:
        By “used for inference,” this just means basically letting people use the model? Like when I go to the chatgpt website, I am using the datacenter campus computers that were previously used for training? (Again, please forgive my noobie questions.)
        
        For 2025, Abilene is building a 100,000-chip campus. This is plausibly around the same number of chips that were used to train the~3e26 FLOPs GPT4.5 at the Goodyear campus. However, the Goodyear campus was using H100 chips, but Abilene will be using Blackwell NVL72 chips. These improved chips means that for the same number of chips we can now train a 1e27 FLOPs model instead of just a 3e26 model. The chips can be built by summer 2025, and a new model trained by around end of year 2025.
        
        1.5 years after the Blackwell chips, the new Rubin chip will arrive. The time is now currently ~2027.5.
        
        Now a few things need to happen:
        The revenue growth rate from 2024 to 2025 of 3x/year continues to hold. In that case, after 1.5 years, we can expect $60bn in revenue by 2027.5.
        The ‘raised money’ : ‘revenue’ ratio of $30bn : $12bn in 2025 holds again. In that case we have $60bn x 2.5 = $150bn.
        The decision would need to be made to purchase the $150 bn worth of Rubin chips (and Nvidia would need to be able to supply this.)
        More realistically, assuming (1) and (2) hold, it makes more sense to wait until the Rubin Ultra comes out to spend the $150bn on.
        Or, some type of mixed buildout would occur, some of that $150bn in 2027.5 would use the Rubin non-Ultra to train a 2e28 FLOPs model, and the remainder would be used to build an even bigger model in 2028 that uses Rubin Ultra.
        Vladimir_Nesov 8 Apr 2025 18:41 UTC
        3 points
        0
        Parent
        “Revenue by 2027.5” needs to mean “revenue between summer 2026 and summer 2027″. And the time when the $150bn is raised needs to be late 2026, not “2027.5”, in order to actually build the thing by early 2027 and have it completed for several months already by mid to late 2027 to get that 5e28 BF16 FLOPs model. Also Nvidia would need to have been expecting this or similar sentiment elsewhere months to years in advance, as everyone in the supply chain can be skeptical that this kind of money actually materializes by 2027, and so that they need to build additional factories in 2025-2026 to meet the hypothetical demand of 2027.
        
        By “used for inference,” this just means basically letting people use the model?
        
        It means using the compute to let people use various models, not necessarily this one, while the model itself might end up getting inferenced elsewhere. Numerous training experiments can also occupy a lot of GPU-time, but they will be smaller than the largest training run, and so the rest of the training system can be left to do other things. In principle some AI companies might offer cloud provider services and sell the time piecemeal on the older training systems that are no longer suited for training frontier models, but very likely they have a use for all that compute themselves.
    - Fergus Argyll 8 Apr 2025 17:14 UTC
      3 points
      1
      Parent
      I think he was saying:
      By the time the new chip is ready, that will be 1.5 years which implies 5x growth if we assume 3x per year. So; by the time OpenBrain is ready to build the next datacenter, we’re in middle/late 2026 instead of beginning of 26.
      Aside from that, the idea that investment will scale proportionally seems like a huge leap of faith. If the next training run does not deliver the goods there is no way softbank et al. pour in 100B.