To make this the case, the new paradigm should, when properly studied and optimized, lead to more efficient AI systems than DL, below the threshold where it stops scaling. The alternative paradigm should allow to reach the same level of performance with less compute spent. For example, imagine the new paradigm used statistical models with a training procedure close to kosher Bayesian inference, thus having a near-guarantee of squeezing all the information out of the training data (within the capped intelligence of the model).
I now think that LLM pre-training is probably already pretty statistically efficient, so nope, can’t do substantially better through the route in this specific example.
Comment to myself after a few months:
I now think that LLM pre-training is probably already pretty statistically efficient, so nope, can’t do substantially better through the route in this specific example.