I agree, a lot of my uncertainty is on its external validity, and also the degree to which the models are being bench-maxed for the tasks in the benchmark. But I still think it’s reasonable to expect the statistical confidence intervals of individual models to be less wide than a factor of 10. It’s important to be able to distinguish possible changes to the trend from statistical artifacts. This seems solvable with additional tasks and more human testing.
I agree, a lot of my uncertainty is on its external validity, and also the degree to which the models are being bench-maxed for the tasks in the benchmark. But I still think it’s reasonable to expect the statistical confidence intervals of individual models to be less wide than a factor of 10. It’s important to be able to distinguish possible changes to the trend from statistical artifacts. This seems solvable with additional tasks and more human testing.