Benchmarks sound like a way to see how well LLMs do when up against reality, but I don’t think they really are. Solving SAT problems or Math Olympiad problems only involves deep thinking if you haven’t seen millions of math problems of a broadly similar nature.
Benchmarks are best-case analysis of model capabilities. A lot of companies benchmark max, but is this inherently bad? If the process is economically valuable and repetitive, I don’t care how the LLM gets it done even if it is memorizing the steps.
I think the benchmarks give a misleading impression of the capabilities of AI. It makes it seem like they’re on the verge of being as smart as humans. It makes it sound like they’re ready to take on a bunch of economically valuable activity that they’re not, leading to the issues currently happening with bosses making their employees use LLMs, for example.
Benchmarks are best-case analysis of model capabilities. A lot of companies benchmark max, but is this inherently bad? If the process is economically valuable and repetitive, I don’t care how the LLM gets it done even if it is memorizing the steps.
I think the benchmarks give a misleading impression of the capabilities of AI. It makes it seem like they’re on the verge of being as smart as humans. It makes it sound like they’re ready to take on a bunch of economically valuable activity that they’re not, leading to the issues currently happening with bosses making their employees use LLMs, for example.