I agree it’s potentially a significant issue. One reason I’m relatively less concerned with it is that the AAII scores for these models seem generally pretty reasonable. Another reason is that the results look pretty similar if we only look at more recent models (which by and large have AAII-run benchmarks). E.g., starting July 2024 yields median 1.22 OOMs and weighted 1.85 OOMs.
There are many places for additional and follow-up work and this is one of them, but I don’t think it invalidates the overall results.
I agree it’s potentially a significant issue. One reason I’m relatively less concerned with it is that the AAII scores for these models seem generally pretty reasonable. Another reason is that the results look pretty similar if we only look at more recent models (which by and large have AAII-run benchmarks). E.g., starting July 2024 yields median 1.22 OOMs and weighted 1.85 OOMs.
There are many places for additional and follow-up work and this is one of them, but I don’t think it invalidates the overall results.