The pressure of decentralization at this scale will also incentivize a lot more research on how to do search/planning. Otherwise you wind up with a lot of ‘stranded’ GPU/TPU capacity where they are fully usable, but aren’t necessary for serving old models and can’t participate in the training of new scaled-up models. But if you switch to a search-and-distill-centric approach, suddenly all of your capacity comes online.
I came to the comments section to say a similar thing. Right now, the easiest way for companies to push the frontier of capabilities is via throwing more hardware and electricity at the problem, as well as doing some efficiency improvements. If the cost or unavailability of electrical power or hardware were to become a bottleneck, then that would simply tilt the equation more in the direction of searching for more compute efficient methods.
I believe there’s plenty of room to spend more on research there and get decent returns for investment, so I doubt the compute bottleneck would make much of a difference. I’m pretty sure we’re already well into a compute-overhang regime in terms of what the compute costs of more efficient model architectures would be like.
I think the same is true for data, and the potential to spend more research investment on looking for more data efficient algorithms.
The pressure of decentralization at this scale will also incentivize a lot more research on how to do search/planning. Otherwise you wind up with a lot of ‘stranded’ GPU/TPU capacity where they are fully usable, but aren’t necessary for serving old models and can’t participate in the training of new scaled-up models. But if you switch to a search-and-distill-centric approach, suddenly all of your capacity comes online.
I came to the comments section to say a similar thing. Right now, the easiest way for companies to push the frontier of capabilities is via throwing more hardware and electricity at the problem, as well as doing some efficiency improvements. If the cost or unavailability of electrical power or hardware were to become a bottleneck, then that would simply tilt the equation more in the direction of searching for more compute efficient methods.
I believe there’s plenty of room to spend more on research there and get decent returns for investment, so I doubt the compute bottleneck would make much of a difference. I’m pretty sure we’re already well into a compute-overhang regime in terms of what the compute costs of more efficient model architectures would be like. I think the same is true for data, and the potential to spend more research investment on looking for more data efficient algorithms.