There are literal interpretations of these predictions that aren’t very strong:
I expect a new model to be released, one which does not rely on adapting pretrained transformers or distilling a larger pretrained model
It will be inspired by the line of research I have outlined above, or a direct continuation of one of the listed architectures
It will have language capabilities equal to or surpassing GPT-4
It will have a smaller parameter count (by 1-2+ OOMs) compared to GPT-4
GPT-4 was rumored to have 1.8T parameters, so <180B parameters would technically satisfy 4. My impression is that current ~70B open-weight models (e.g. Qwen 2.5) are already roughly as good as the original GPT-4 was. (Of course that’s not a fair comparison since the 1.8T parameter rumor is for an MoE model.)
So the load-bearing part is arguably “inspired by [this] line of research,” but I’m not sure what would or wouldn’t count for that. E.g. a broad interpretation could argue that any test-time training / continual learning approach would count, even if most of the capabilities still come from pretraining similar to current approaches. (Still a non-trivial prediction to be clear!)
My impression was that you’re intending to make stronger claims than this broad interpretation. If so, you could consider picking slightly different concretizations to make the predictions more impressive if you end up being right. For example, I’d consider 2 OOMs fewer parameters than GPT-4 noticeably more impressive than 1 OOM (and my guess would be that the divergence between your view and the broader community would be even larger at 3 OOMs fewer parameters). Might be even better to tie it to compute and/or training data instead of parameters. You could also try to make the “inspired by this research” claim more concrete (e.g. “<10% of training compute before model release is spent on training on offline/IID data”, if you believe some claim of that form).
There are literal interpretations of these predictions that aren’t very strong:
GPT-4 was rumored to have 1.8T parameters, so <180B parameters would technically satisfy 4. My impression is that current ~70B open-weight models (e.g. Qwen 2.5) are already roughly as good as the original GPT-4 was. (Of course that’s not a fair comparison since the 1.8T parameter rumor is for an MoE model.)
So the load-bearing part is arguably “inspired by [this] line of research,” but I’m not sure what would or wouldn’t count for that. E.g. a broad interpretation could argue that any test-time training / continual learning approach would count, even if most of the capabilities still come from pretraining similar to current approaches. (Still a non-trivial prediction to be clear!)
My impression was that you’re intending to make stronger claims than this broad interpretation. If so, you could consider picking slightly different concretizations to make the predictions more impressive if you end up being right. For example, I’d consider 2 OOMs fewer parameters than GPT-4 noticeably more impressive than 1 OOM (and my guess would be that the divergence between your view and the broader community would be even larger at 3 OOMs fewer parameters). Might be even better to tie it to compute and/or training data instead of parameters. You could also try to make the “inspired by this research” claim more concrete (e.g. “<10% of training compute before model release is spent on training on offline/IID data”, if you believe some claim of that form).