However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
Benchmark data is always recent and relevant
Any thoughts?
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
there’s this https://github.com/Jellyfish042/uncheatable_eval
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
This is great! Would like to see a continually updating public leaderboard of this.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.