Meta’s previous LLM, OPT-175B, seemed good by benchmarks but was widely agreed to be much, much worse than GPT-3 (not even necessarily better than GPT-Neo-20b). It’s an informed guess, not a random dunk, and does leave open the possibility that they’re turned it around and have a great model this time rather than something which goodharts the benchmarks.
It is indeed prevailing wisdom that OPT isn’t very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data.
OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning.
But yeah, I do think Eliezer is at most directionally correct—I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI.
Meta’s previous LLM, OPT-175B, seemed good by benchmarks but was widely agreed to be much, much worse than GPT-3 (not even necessarily better than GPT-Neo-20b). It’s an informed guess, not a random dunk, and does leave open the possibility that they’re turned it around and have a great model this time rather than something which goodharts the benchmarks.
Just a note, I googled a bit and couldn’t find anything regarding poor OPT-175B performance.
To back up plex a bit:
It is indeed prevailing wisdom that OPT isn’t very good, despite being decent on becnhmarks, though generally the baseline comparison is to code-davinci-002 derived models (which do way better on benchmarks) or smaller models like UL2 that were trained with comparable compute and significantly more data.
OpenAI noted in the original InstructGPT paper that performance on benchmarks can be un-correlated with human rater preference during finetuning.
But yeah, I do think Eliezer is at most directionally correct—I suspect that LLaMA will see significant use amongst at least both researchers and Meta AI.