Yeah, could cut both ways for this I think? On the one hand, if no-MatMul models really are more efficient in the long run, you could probably make custom hardware optimized for the stuff they require (e.g. lots of ternary stuff). But getting there from the ASICs currently in development would be a necessary pivot.
Maybe the race dynamics actually help slow things down here? Since nobody wants to pivot and fall temporarily behind; money might dry up or someone else might get there before the investment pays off and you leapfrog.
But yeah, even in the medium run, as constraints start to flare up, probably ASICs are a factor in changing up architectures.
Yeah, you’ve convinced me I was a little too weak just by saying “the scaling laws are untested”—I had the same feeling of like “maybe I’m getting Eulered here, and maybe they’re Eulering themselves” with the 10^23 thing.
Mostly I just kept seeing suggested articles in the mainstream-ish tech press about this “wow, no MatMul” thing, assumed it was an overhyped exaggeration/misleading, and was pleasantly surprised it was for real (as far as it goes). But I’d give it probably… 15%? Of having industrial use cases in the next few years. Which I guess is actually pretty high! Could be nice for really really huge context windows, where scaling on input token length sucks.