My perspective is a bit different: my impression is that: for any algorithm whatsoever, an ASIC tailored to that algorithm will run the algorithm much better than a general-purpose chip can.
The upshot of the Chen paper is that “sparse algorithm on commodity hardware” can outperform “dense algorithm on ASIC tailored to dense algorithm”. Missing from this picture is “sparse algorithm on ASIC tailored to sparse algorithm”. My claim is that the latter setup would perform better still.
(Incidentally, I don’t think it’s the case that literally all existing, or under development, ML ASICs are tailored exclusively to dense algorithms. I vaguely recall hearing that certain upcoming chips will be good at sparse calculations.)
So anyway, on my model, the only way an AGI developer would use commodity hardware is if they both (1) can and (2) must.
(1) involves hardware overhang—we discover new awesome AGI-capable algorithms that require “so little” compute (which may still be a ton of compute in everyday terms) that it’s feasible to get enough commodity chips to do it. (Incidentally, I think this is pretty plausible—see here. Lots of people would disagree with me on that, though.)
(2) would be if the algorithm is new (or its importance is newly appreciated), such that we have AGI before the 1-2year period required to roll an ASIC, or if there are future treaties restricting custom ASICs or whatever. For my part, I think the former is unlikely but not impossible. As for the latter, you would know better than me.
(Another possibility is that the new AGI-capable algorithm requires so little compute that it’s not worth the effort and money to roll an ASIC tailored to that algorithm, even if there’s time to do so. I don’t put much weight on that, but who knows, I guess.)
My perspective is a bit different: my impression is that: for any algorithm whatsoever, an ASIC tailored to that algorithm will run the algorithm much better than a general-purpose chip can.
Why do you think that? ASICs seem to benefit primarily from hardwiring control flow and removing overhead. The more control flow, the less the ASIC helps. Cryptocurrencies, starting with memory-hard PoWs, have been experimenting with ASIC resistance for a long time now. As I understand it, ASIC-resistance has succeeded in the sense that despite enormous financial incentives and over half a decade, the best ASICs for PoWs designed to be ASIC-resistant typically have a small constant factor improvement like 3x, and nothing remotely like the 10,000x speedup you might get from CPU video codec → ASIC. You can also point to lots of AI algorithms which people don’t bother putting on GPUs because they lack the intrinsic parallelism & are control-flow heavy.
I don’t think I disagree much. When I said “much better” I was thinking to myself “as much as 10x!” not “as much as 10,000x!”
Yes there are lots of AI algorithms that people don’t put on GPUs. I just suspect that if people were spending many millions of dollars running those particular AI algorithms, for many consecutive years, they would probably eventually find it worth their while to make an ASIC for that algorithm. (And that ASIC might or might not look anything like a GPU).
If crypto people are specifically designing algorithms to be un-ASIC-able, I’m not sure we should draw broader lessons from that. Like, of course off-the-shelf CPUs are going to be almost perfectly optimal for some algorithm out of the space of all possible algorithms.
Anyway, even if my previous comment (“any algorithm whatsoever”) is wrong (taken literally, it certainly is, see previous sentence, sorry for being sloppy), I’m somewhat more confident about the subset of algorithms that are AGI-relevant, since those will (I suspect) have quite a bit of parallelizability. For example, the Chen etc al. algorithm described in OP sounds pretty parallelizable (IIUC), even if it can’t be parallelized by today’s GPUs.
My perspective is a bit different: my impression is that: for any algorithm whatsoever, an ASIC tailored to that algorithm will run the algorithm much better than a general-purpose chip can.
The upshot of the Chen paper is that “sparse algorithm on commodity hardware” can outperform “dense algorithm on ASIC tailored to dense algorithm”. Missing from this picture is “sparse algorithm on ASIC tailored to sparse algorithm”. My claim is that the latter setup would perform better still.
(Incidentally, I don’t think it’s the case that literally all existing, or under development, ML ASICs are tailored exclusively to dense algorithms. I vaguely recall hearing that certain upcoming chips will be good at sparse calculations.)
So anyway, on my model, the only way an AGI developer would use commodity hardware is if they both (1) can and (2) must.
(1) involves hardware overhang—we discover new awesome AGI-capable algorithms that require “so little” compute (which may still be a ton of compute in everyday terms) that it’s feasible to get enough commodity chips to do it. (Incidentally, I think this is pretty plausible—see here. Lots of people would disagree with me on that, though.)
(2) would be if the algorithm is new (or its importance is newly appreciated), such that we have AGI before the 1-2year period required to roll an ASIC, or if there are future treaties restricting custom ASICs or whatever. For my part, I think the former is unlikely but not impossible. As for the latter, you would know better than me.
(Another possibility is that the new AGI-capable algorithm requires so little compute that it’s not worth the effort and money to roll an ASIC tailored to that algorithm, even if there’s time to do so. I don’t put much weight on that, but who knows, I guess.)
Why do you think that? ASICs seem to benefit primarily from hardwiring control flow and removing overhead. The more control flow, the less the ASIC helps. Cryptocurrencies, starting with memory-hard PoWs, have been experimenting with ASIC resistance for a long time now. As I understand it, ASIC-resistance has succeeded in the sense that despite enormous financial incentives and over half a decade, the best ASICs for PoWs designed to be ASIC-resistant typically have a small constant factor improvement like 3x, and nothing remotely like the 10,000x speedup you might get from CPU video codec → ASIC. You can also point to lots of AI algorithms which people don’t bother putting on GPUs because they lack the intrinsic parallelism & are control-flow heavy.
I don’t think I disagree much. When I said “much better” I was thinking to myself “as much as 10x!” not “as much as 10,000x!”
Yes there are lots of AI algorithms that people don’t put on GPUs. I just suspect that if people were spending many millions of dollars running those particular AI algorithms, for many consecutive years, they would probably eventually find it worth their while to make an ASIC for that algorithm. (And that ASIC might or might not look anything like a GPU).
If crypto people are specifically designing algorithms to be un-ASIC-able, I’m not sure we should draw broader lessons from that. Like, of course off-the-shelf CPUs are going to be almost perfectly optimal for some algorithm out of the space of all possible algorithms.
Anyway, even if my previous comment (“any algorithm whatsoever”) is wrong (taken literally, it certainly is, see previous sentence, sorry for being sloppy), I’m somewhat more confident about the subset of algorithms that are AGI-relevant, since those will (I suspect) have quite a bit of parallelizability. For example, the Chen etc al. algorithm described in OP sounds pretty parallelizable (IIUC), even if it can’t be parallelized by today’s GPUs.