gwern comments on Memory bandwidth constraints imply economies of scale in AI inference

gwern 17 Dec 2023 19:55 UTC
5 points
0
If you read through the podcast, which is the only material I could quickly find laying out the Etched paradigm in any kind of detail, their argument seems to be that they can improve the workflow and easily pay for a trivial $1m (which is what, a measly 20 H100 GPUs?), and that, as AI eats the global white-collar economy, inference costs is the main limit and the main obstacle to justifying the training runs for even more powerful models (it does you little good to create GPT-5 if you can’t then inference it at a competitive cost), and so plenty of companies actually would need or buy such chips, and many would find it worthwhile to make their own by finetuning on a company-wide corpus (akin to BlombergGPT).

At current economics, it might not make sense, sure; but they are big believers in the future, and point to other ways to soak up that compute: tree search, specifically. (You may not need that many GPT-4 tokens, because of its inherent limitations, so burning it onto a chip to make it >100x cheaper doesn’t do you much good, but if you can figure out how to do MCTS to make it the equivalent of GPT-6 at the same net cost...)

I’m not sure how much I believe their proprietary simulations claiming such speedups, and I’d definitely be concerned about models changing so fast* that this doesn’t make any sense to do for the foreseeable future given all of the latencies involved (how useful would a GPT-2 ASIC be today, even if you could run it for free at literally $0/token?), so this strikes me as a very gutsy bet but one that could pay off—there are many DL hardware startups, but I don’t know of anyone else seriously pursuing the literally-make-a-NN-ASIC idea.

* right now, the models behind the big APIs like Claude or ChatGPT change fairly regularly. Obviously, you can’t really do that with an ASIC which has burned in the weights… so you would either have to be very sure you don’t want to update the model any time soon or you have to figure out some way to improve it, like pipelining models, perhaps, or maybe leaving in unused transistors which can be WORMed to periodically add in ‘update layers’ akin to lightweight finetuning of individual layers. If you believe burned-in ASICs are the future, similar to Hinton’s ‘mortal nets’, this would be a very open and almost untouched area of research: how to best ‘work around’ an ASIC being inherently WORM.
- gwern 25 Jun 2024 17:52 UTC
  4 points
  0
  Parent
  They appear to have launched ‘Sohu’, for LLaMA-3-70b: https://www.etched.com/announcing-etched