Broken Benchmark: MMLU

Link post

Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.

Among them:

  • Crucial context missing from questions (apparently copy-paste errors?)

  • Ambiguous sets of answers

  • Wrong sets of answers

He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.

I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.