I would guess that this is mainly due to there being much more limited FLOP data for the closed models (especially for recent models), and for closed models focusing much less on small training FLOP models (eg <1e25 FLOP)
Another consideration is that we tend to prioritize evaluations on frontier models, so coverage for smaller models is spottier.
I would guess that this is mainly due to there being much more limited FLOP data for the closed models (especially for recent models), and for closed models focusing much less on small training FLOP models (eg <1e25 FLOP)
Another consideration is that we tend to prioritize evaluations on frontier models, so coverage for smaller models is spottier.