gwern comments on sarahconstantin’s Shortform

gwern 30 Jan 2025 20:28 UTC
21 points
15

then I think it is also very questionable whether the AI that wins wars is the most “advanced” AI. / People like Dario whose bread-and-butter is model performance invariably over-index on model performance, especially on benchmarks. But practical value comes from things besides the model; what tasks you use it for and how effective you are at deploying it.

Dario is about the last AI CEO you should be making this criticism of. Claude has been notable for a while for the model which somehow winds up being the most useful and having the best ‘vibes’, even when the benchmarks indicate it’s #2 or #3; and meanwhile, it is the Chinese models which historically regress the most from their benchmarks when applied (and DeepSeek models, while not as bad as the rest, still do this and r1 is already looking shakier as people try out heldout problems or benchmarks).