Question from a conference discussion—do you have real-world-data examples in a shape like this illustration of how I understand Goodhart’s law?
E.g. when new LLM versions got better at a benchmark published before their training but not on a benchmark published later… (and if larger models generalize better to future benchmarks, does the typology here provide an insight why? does the bitter lesson of scale provide better structure, function, or randomness calibration? or if the causes are so unknown that this typology does not provide such an insight yet, what is missing? (because Goodhart law predicts that benchmark gaming should be happening to “some extent” ⇒ improvements in understanding of Goodhart law should preserve that quality, right?))
Question from a conference discussion—do you have real-world-data examples in a shape like this illustration of how I understand Goodhart’s law?
E.g. when new LLM versions got better at a benchmark published before their training but not on a benchmark published later… (and if larger models generalize better to future benchmarks, does the typology here provide an insight why? does the bitter lesson of scale provide better structure, function, or randomness calibration? or if the causes are so unknown that this typology does not provide such an insight yet, what is missing? (because Goodhart law predicts that benchmark gaming should be happening to “some extent” ⇒ improvements in understanding of Goodhart law should preserve that quality, right?))