I do think a pure case of 1 is rare because usually what constitutes a specific task or capability is not contained or narrowly defined enough that a benchmark can sample from the whole distribution of that task/capability. Usually a benchmark will just sample a small and verifiable part of the distribution of tasks of a specific type, which is hopefully correlated with the more general capability on that type of task.
Math is an interesting case. If you consider for example IMO problems, that is basically such a well defined task/capability, and actual IMO problems are samples from this distribution, so the benchmark measures the capability exactly and directly, and the precision of the measurement could be increased indefinitely by just drawing new samples. Other subfields of math could perhaps be somewhat similar, where finding proofs for each of a certain class of theorems forms a similar well defined problem class, although I’m not very familiar with academic math.
That’s an interesting distinction.
I do think a pure case of 1 is rare because usually what constitutes a specific task or capability is not contained or narrowly defined enough that a benchmark can sample from the whole distribution of that task/capability. Usually a benchmark will just sample a small and verifiable part of the distribution of tasks of a specific type, which is hopefully correlated with the more general capability on that type of task.
Math is an interesting case. If you consider for example IMO problems, that is basically such a well defined task/capability, and actual IMO problems are samples from this distribution, so the benchmark measures the capability exactly and directly, and the precision of the measurement could be increased indefinitely by just drawing new samples. Other subfields of math could perhaps be somewhat similar, where finding proofs for each of a certain class of theorems forms a similar well defined problem class, although I’m not very familiar with academic math.