Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.