Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.
Crucial context missing from questions (apparently copy-paste errors?)
Ambiguous sets of answers
Wrong sets of answers
He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.
I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.