TPUs are already effectively leaping above the GPU trend in price-performance. It is difficult to find an exact cost for a TPU because they are not sold retail, but my own low-confidence estimates for the price of a TPU v5e place its price-performance significantly above the GPU given in the plot. I would expect that the front runner in price-performance cease to be what we think of as GPUs and thus intrinsic architectural limitations of GPUs cease to be the critical bottleneck.
Expecting price-performance to improve doesn’t mean we necessarily expect hardware to improve, just that we become more efficient at making hardware. Economies of scale and refinements in manufacturing technology can dramatically improve price-performance by reducing manufacturing costs, without any improvement in the underlying hardware. Of course, in reality we expect both the hardware to become faster and the price of manufacturing it to fall. This is even more true as the sheer quantity of money being poured into compute manufacturing goes parabolic.
Nvidia’s stock price and domination of the AI compute market is evidence against your strong claim that “TPUs are already effectively leaping above the GPU trend”. As is the fact that Google Cloud is—from what I can tell—still more successful renting out nvidia gpus than TPUs, and still trying to buy H100 in bulk.
There isn’t alot of info yet on TPU v5e and zero independent benchmarks to justify such a strong claim (Nvidia dominates MLPerf benchmarks).
Google’s own statements on TPU v5e also contradict the claim:
Google has often compared its TPUs to Nvidia’s GPUs but was cautious with the TPU v5e announcement. Google stressed it was focused on offering a variety of AI chips to its customers, with Nvidia’s H100 GPUs in the A3 supercomputer and TPU v5e for inferencing and training.
The performance numbers point to the TPU v5e being adapted for inferencing instead of training. The chip offers a peak performance of 393 teraflops of INT8 performance per chip, which is better than 275 petaflops on TPU v4.
But the TPU v5e scores poorly on BF16 performance, with its 197 teraflops falling short of the 275 teraflops on the TPU v4.
It apparently doesn’t have FP8, and the INT8 perf is less than peak throughput FP8 of a RTX 4090, which only costs $1,500 (and is the current champion in flops per dollar). The H100 has petaflops of FP8/INT8 perf per chip.
From my notes. Your statement about RTX 4090 leading the pack in flops per dollar does not seem correct based on these sources, perhaps you have a better source for your numbers than I do.
I did not realize that H100 had >3.9 PFLOPS at 8-bit precision until you prompted me to look, so I appreciate that nudge. That does put the H100 above the TPU v5e in terms of FLOPS/$. Prior to that addition, you can see why I said TPU v5e was taking the lead. Note that the sticker price for TPU v5e is estimated, partly from a variety of sources, partly from my own estimate calculated from the lock-in hourly usage rates.
Note that FP8 and INT8 are both 8-bit computations and are in a certain sense comparable if not necessarily equivalent.
There are many different types of “TFLOPS” that are not directly comparable, independent of precision. The TPU v5e does not have anything remotely close to 393 TFLOPs of general purpose ALU performance. The number you are quoting is the max perf of its dedicated matmul ALU ASIC units, which are most comparable to nvidia tensorcores, but worse as they are less flexible (much larger block volumes).
The RTX 4090 has ~82 TFLOPs of general purpose SIMD 32⁄16 bit flops—considerably more than the 51 or 67 TFLOPs of even the H100. I’m not sure what the general ALU flops of the TPU are, but it’s almost certainly much less than the H100 and therefore less than the 4090.
The 4090′s theoretical tensorcore perf is 330⁄661 for fp16[1] and 661/1321[2][3] for fp8 dense/sparse (sparse using nvidia’s 2:1 local block sparsity encoding), and 661 int8 TOPs (which isn’t as useful as fp8 of course). You seem to be using the sparse 2:1 fp8 tensorcore or possibly even 4bit pathway perf for H100, so that is most comparable. So if you are going to use INT8 precision for the TPU, well the 4090 has double that with 660 8-bit integer TOPS for about 1/4th the price. The 4090 has about an OOM lead in low precision flops/$ (in theory).
Of course what actually matters is practical real world benchmark perf due to the complex interactions between RAM and cache quantity, various types of bandwidths (on-chip across various caches, off-chip to RAM, between chips etc) and so on, and nvidia dominates in most real world benchmarks.
Would you say we are limited by GPU RAM instead? I don’t see that growing as fast.
A couple of things:
TPUs are already effectively leaping above the GPU trend in price-performance. It is difficult to find an exact cost for a TPU because they are not sold retail, but my own low-confidence estimates for the price of a TPU v5e place its price-performance significantly above the GPU given in the plot. I would expect that the front runner in price-performance cease to be what we think of as GPUs and thus intrinsic architectural limitations of GPUs cease to be the critical bottleneck.
Expecting price-performance to improve doesn’t mean we necessarily expect hardware to improve, just that we become more efficient at making hardware. Economies of scale and refinements in manufacturing technology can dramatically improve price-performance by reducing manufacturing costs, without any improvement in the underlying hardware. Of course, in reality we expect both the hardware to become faster and the price of manufacturing it to fall. This is even more true as the sheer quantity of money being poured into compute manufacturing goes parabolic.
Lol what? Nvidias stock price says otherwise (as does a deep understanding of the hardware)
Could you lay that out for me, a little bit more politely? I’m curious.
Nvidia’s stock price and domination of the AI compute market is evidence against your strong claim that “TPUs are already effectively leaping above the GPU trend”. As is the fact that Google Cloud is—from what I can tell—still more successful renting out nvidia gpus than TPUs, and still trying to buy H100 in bulk.
There isn’t alot of info yet on TPU v5e and zero independent benchmarks to justify such a strong claim (Nvidia dominates MLPerf benchmarks).
Google’s own statements on TPU v5e also contradict the claim:
It apparently doesn’t have FP8, and the INT8 perf is less than peak throughput FP8 of a RTX 4090, which only costs $1,500 (and is the current champion in flops per dollar). The H100 has petaflops of FP8/INT8 perf per chip.
From my notes. Your statement about RTX 4090 leading the pack in flops per dollar does not seem correct based on these sources, perhaps you have a better source for your numbers than I do.
I did not realize that H100 had >3.9 PFLOPS at 8-bit precision until you prompted me to look, so I appreciate that nudge. That does put the H100 above the TPU v5e in terms of FLOPS/$. Prior to that addition, you can see why I said TPU v5e was taking the lead. Note that the sticker price for TPU v5e is estimated, partly from a variety of sources, partly from my own estimate calculated from the lock-in hourly usage rates.
Note that FP8 and INT8 are both 8-bit computations and are in a certain sense comparable if not necessarily equivalent.
There are many different types of “TFLOPS” that are not directly comparable, independent of precision. The TPU v5e does not have anything remotely close to 393 TFLOPs of general purpose ALU performance. The number you are quoting is the max perf of its dedicated matmul ALU ASIC units, which are most comparable to nvidia tensorcores, but worse as they are less flexible (much larger block volumes).
The RTX 4090 has ~82 TFLOPs of general purpose SIMD 32⁄16 bit flops—considerably more than the 51 or 67 TFLOPs of even the H100. I’m not sure what the general ALU flops of the TPU are, but it’s almost certainly much less than the H100 and therefore less than the 4090.
The 4090′s theoretical tensorcore perf is 330⁄661 for fp16[1] and 661/1321[2][3] for fp8 dense/sparse (sparse using nvidia’s 2:1 local block sparsity encoding), and 661 int8 TOPs (which isn’t as useful as fp8 of course). You seem to be using the sparse 2:1 fp8 tensorcore or possibly even 4bit pathway perf for H100, so that is most comparable. So if you are going to use INT8 precision for the TPU, well the 4090 has double that with 660 8-bit integer TOPS for about 1/4th the price. The 4090 has about an OOM lead in low precision flops/$ (in theory).
Of course what actually matters is practical real world benchmark perf due to the complex interactions between RAM and cache quantity, various types of bandwidths (on-chip across various caches, off-chip to RAM, between chips etc) and so on, and nvidia dominates in most real world benchmarks.
wikipedia
toms hardware
nvidia ada gpu arch