For some models (especially older ones), the Artificial Analysis Intelligence Index score is labeled as “Estimate (independent evaluation forthcoming)”. It is unclear how these scores are determined, and they may not be a reliable estimate. The Artificial Analysis API does not clearly label such estimates and I did not manually remove them for secondary analysis. Ideally the capability levels that have these models (probably typically lower levels) would be weighted less, but I don’t do this due to uncertainty about which models have Estimates vs. independently tested scores.
IMO this is a potentially significant issue that this post should have spent more time addressing, since it means that the earlier sections of the trend lines are coming from a source we know nothing about.
I agree it’s potentially a significant issue. One reason I’m relatively less concerned with it is that the AAII scores for these models seem generally pretty reasonable. Another reason is that the results look pretty similar if we only look at more recent models (which by and large have AAII-run benchmarks). E.g., starting July 2024 yields median 1.22 OOMs and weighted 1.85 OOMs.
There are many places for additional and follow-up work and this is one of them, but I don’t think it invalidates the overall results.
IMO this is a potentially significant issue that this post should have spent more time addressing, since it means that the earlier sections of the trend lines are coming from a source we know nothing about.
I agree it’s potentially a significant issue. One reason I’m relatively less concerned with it is that the AAII scores for these models seem generally pretty reasonable. Another reason is that the results look pretty similar if we only look at more recent models (which by and large have AAII-run benchmarks). E.g., starting July 2024 yields median 1.22 OOMs and weighted 1.85 OOMs.
There are many places for additional and follow-up work and this is one of them, but I don’t think it invalidates the overall results.