Vladimir_Nesov comments on Mitchell_Porter’s Shortform

Vladimir_Nesov 15 May 2025 15:10 UTC
11 points
0

unlike every other company in the space they aren’t dependant on Nvidia’s chips like everyone else is

AWS is credibly becoming independent with their Trainium 2 Ultra and Project Rainier (250K H100s worth of compute in a single system). The world size isn’t on the level of GB200 NVL72 though, so it’s not nearly as good for reasoning models, but it should be fine for pretraining giant models.

There’s also Huawei’s CloudMatrix 384 which does rival GB200 NVL72 at world size, but is built on a 7nm process and optical scale-up networking, with currently known dies manufactured by TSMC via intermediaries, though they did manufacture enough to in principle match Crusoe/Stargate/OpenAI Abilene datacenter campus as it will be in 2026, and in theory their domestic manufacturing might catch up.

The thing about 7nm chips is that they are only 1 major step (each taking about 2 years) behind the current 4nm Blackwell, and only 2 major steps behind the future 3nm Rubin. Which puts them at merely ~4x price-performance disadvantage (compared to Rubin, in the critical window of 2027-2028), while functionally the large world size of CloudMatrix 384 should in principle allow anything that Nvidia hardware allows to do (in practice it might be much less reliable, and programming it might be more difficult). And that’s not counting the Nvidia tax, which might reduce the disparity in half, turning the ~4x cheaper (per FLOP/s) hypothetical zero margin Rubin systems of 2027-2028 into merely ~2x cheaper (than CloudMatrix 384).
- Isopropylpod 15 May 2025 17:16 UTC
  1 point
  0
  Parent
  Trainium is mostly a joke, but I do agree about the Chinese firms moving away from Nvidia dependence. They will also likely have sufficient capital, but will ultimately lack data (though may be able to make up for it with the insane talent they have? If timelines end up long I can easily see China pulling ahead simply due to their prior focus on education and talent paying off long-term)
  - Vladimir_Nesov 15 May 2025 18:00 UTC
    4 points
    0
    Parent
    
    Trainium is mostly a joke
    
    I think it can help AWS with price-performance for the narrow goal of giant pretraining runs, where the capex on training systems might be the primary constraint on scaling soon. For reasoning training (if it does scale), building a single training system is less relevant, the usual geographically distributed inference buildout that hyperscalers are doing anyway would be about as suitable. And the 400K chip Rainier system indicates that it works well enough to ramp (serving as a datapoint in addition to on-paper specification).
    
    Chinese firms … will ultimately lack data
    
    I don’t think there is a meaningful distinction for data, all natural text data is running out anyway around 2027-2029 due to data inefficiency of MoE. No secret stashes at Google or Meta are going to substantially help, since even 10T-100T tokens won’t be changing the game.
    - Isopropylpod 16 May 2025 2:24 UTC
      3 points
      0
      Parent
      You’re right about text, but Google has privileged access to YouTube (A significant % of all video ever recorded by humans)