Broken Benchmark: MMLU

awg29 Aug 2023 18:09 UTC

23 points

Phillip over at the AI Explained channel has been running some experiments on his SmartGPT framework against the MMLU benchmark and discovered a not-insignificant amount of issues with the problem set.

Among them:

Crucial context missing from questions (apparently copy-paste errors?)
Ambiguous sets of answers
Wrong sets of answers

He highlights a growing need for a proper benchmarking organization that can research and create accurate, robust, sensible benchmarking suites for evaluating SOTA models.

I found this video to be super interesting and the findings to be very important, so I wanted to spread this here.

awg29 Aug 2023 18:09 UTC

23 points

5 comments1 min readLW link

AI Benchmarking AI

Dan H 30 Aug 2023 2:11 UTC
23 points
10
Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I’ve seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).
- alenoach 30 Aug 2023 23:47 UTC
  1 point
  2
  Parent
  As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.
- awg 30 Aug 2023 15:47 UTC
  1 point
  −2
  Parent
  Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why?
  For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise “fine” under those conditions (if, for example, we could somehow get that number down to 0%)?
  - Richard_Ngo 30 Aug 2023 23:43 UTC
    6 points
    4
    Parent
    It seems like he’s mainly responding to the implication that this means MMLU is “broken”. Label noise can be both suboptimal and also much less important than this post’s title suggests.
  - O O 30 Aug 2023 16:59 UTC
    1 point
    −2
    Parent
    I imagine researchers at big labs know this and are correcting these errors as models get good enough for this to matter.