I don’t understand how energy is still an appropriate unit for measuring compute capacity when there are two different chip paradigms. Do Nvidia cards and Ironwood TPU’s give the exact same performance for the same energy input? What exactly are the differences in capacity to train/deploy models between the 1 GW capacity Anthropic will have and the 1GW OpenAI will have? I looked into this a bit and it seems like TPU’s are explicitly designed for inference only, is that accurate? I feel like compiling this kind of information somewhere would be a good idea since its all rather opaque, technical, and obfuscated by press releases that seek to push a “look at our awesome 11 figure chip deal” narrative rather than provide actual transparency about capacity.
The Anthropic announcement says “up to one million TPUs”, and the Ironwood announcement claims 4.6e15 FP8 FLOP/s per chip. A 2-die GB200 chip produces 5e15 dense FP8 FLOP/s, and there are about 400K chips in the 1 GW phase of the Abilene system.
Thus if the Anthropic contract is for TPUv7 Ironwood, their 1 GW system will have about 2x the FLOP/s of the Abilene 1 GW system (probably because Ironwood is 3nm, while Blackwell is 4nm, which is a minor refinement of 5nm). Though it’s not clear that the Anthropic contract is for one system, unlike the case with Abilene, that is datacenters with sufficient bandwidth between them. But Google had a lot of time to set up inter-datacenter networking, so this is plausible even for collections of somewhat distant datacenter buildings. If this isn’t the case, then it’s only good for RLVR and inference, not for the largest pretraining runs.
The reason things like this could happen is that OpenAI needed to give the go-ahead for the Abilene system in 2024, when securing a 1 GW Ironwood system from Google plausibly wasn’t in the cards, and in any case they wouldn’t want to depend on Google too much, because GDM is a competitor (and the Microsoft relationship was already souring). On the other hand, Anthropic still has enough AWS backing to make some dependence on Google less crucial, and they only needed to learn recently about the feasibility of a 1 GW system from Google. Perhaps OpenAI will be getting a 1-2 GW system from Google as well at some point, but then Nvidia Rubin (not to mention Rubin Ultra) is not necessarily worse than Google’s next thing.
Do Nvidia cards and Ironwood TPU’s give the exact same performance for the same energy input?
I think it’s a fair assumption that they are close enough. If they weren’t, why on Earth would someone still be using whichever happened to be the vastly more inefficient option?
Well, within reason that can happen—I am not saying the metric is going to be perfect. But it’s probably a decent first order approximation because that logic can’t stretch forever. If instead of a factor of 2 it was a factor of 10 the trade off would probably not be worth it.
Thanks! I guess my original statement came off a bit too strong, but what I meant is that while there is a frontier for trade offs (maybe the GPUs’ greater flexibility is worth the 2x energy cost?), I didn’t expect the gap to be orders of magnitude. That’s good enough for me with the understanding that any such estimates will never be particularly accurate anyway and just give us a rough idea of how much compute these companies are actually fielding. What they squeeze out of that will depend on a bunch of other details anyway, so scale is the best we can guess.
I don’t understand how energy is still an appropriate unit for measuring compute capacity when there are two different chip paradigms. Do Nvidia cards and Ironwood TPU’s give the exact same performance for the same energy input? What exactly are the differences in capacity to train/deploy models between the 1 GW capacity Anthropic will have and the 1GW OpenAI will have? I looked into this a bit and it seems like TPU’s are explicitly designed for inference only, is that accurate? I feel like compiling this kind of information somewhere would be a good idea since its all rather opaque, technical, and obfuscated by press releases that seek to push a “look at our awesome 11 figure chip deal” narrative rather than provide actual transparency about capacity.
The Anthropic announcement says “up to one million TPUs”, and the Ironwood announcement claims 4.6e15 FP8 FLOP/s per chip. A 2-die GB200 chip produces 5e15 dense FP8 FLOP/s, and there are about 400K chips in the 1 GW phase of the Abilene system.
Thus if the Anthropic contract is for TPUv7 Ironwood, their 1 GW system will have about 2x the FLOP/s of the Abilene 1 GW system (probably because Ironwood is 3nm, while Blackwell is 4nm, which is a minor refinement of 5nm). Though it’s not clear that the Anthropic contract is for one system, unlike the case with Abilene, that is datacenters with sufficient bandwidth between them. But Google had a lot of time to set up inter-datacenter networking, so this is plausible even for collections of somewhat distant datacenter buildings. If this isn’t the case, then it’s only good for RLVR and inference, not for the largest pretraining runs.
The reason things like this could happen is that OpenAI needed to give the go-ahead for the Abilene system in 2024, when securing a 1 GW Ironwood system from Google plausibly wasn’t in the cards, and in any case they wouldn’t want to depend on Google too much, because GDM is a competitor (and the Microsoft relationship was already souring). On the other hand, Anthropic still has enough AWS backing to make some dependence on Google less crucial, and they only needed to learn recently about the feasibility of a 1 GW system from Google. Perhaps OpenAI will be getting a 1-2 GW system from Google as well at some point, but then Nvidia Rubin (not to mention Rubin Ultra) is not necessarily worse than Google’s next thing.
I think it’s a fair assumption that they are close enough. If they weren’t, why on Earth would someone still be using whichever happened to be the vastly more inefficient option?
Because it’s what they can get. A factor of two or more in compute is plausibly less important than a delay of a year.
This may or may not be the case, but the argument for why it can’t be very different fails.
Well, within reason that can happen—I am not saying the metric is going to be perfect. But it’s probably a decent first order approximation because that logic can’t stretch forever. If instead of a factor of 2 it was a factor of 10 the trade off would probably not be worth it.
Data. Find out the answer.
https://www.wevolver.com/article/tpu-vs-gpu-a-comprehensive-technical-comparison
Looks like they arehwitin 2x of the H200s, albeit with some complexity in details.
Thanks! I guess my original statement came off a bit too strong, but what I meant is that while there is a frontier for trade offs (maybe the GPUs’ greater flexibility is worth the 2x energy cost?), I didn’t expect the gap to be orders of magnitude. That’s good enough for me with the understanding that any such estimates will never be particularly accurate anyway and just give us a rough idea of how much compute these companies are actually fielding. What they squeeze out of that will depend on a bunch of other details anyway, so scale is the best we can guess.