One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
there’s this https://github.com/Jellyfish042/uncheatable_eval
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
This is great! Would like to see a continually updating public leaderboard of this.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.