It is indeed prevailing wisdom that OPT isn’t very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data.
OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning.
But yeah, I do think Eliezer is at most directionally correct—I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI.
Just a note, I googled a bit and couldn’t find anything regarding poor OPT-175B performance.
To back up plex a bit:
It is indeed prevailing wisdom that OPT isn’t very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data.
OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning.
But yeah, I do think Eliezer is at most directionally correct—I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI.