Ram Potham comments on Ram Potham’s Shortform

Ram Potham 7 Apr 2025 20:38 UTC
1 point
0
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
- Benchmarks may not represent real world ability
- Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
- We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
- Benchmark data is always recent and relevant
Any thoughts?
- gwern 8 Apr 2025 2:41 UTC
  29 points
  11
  Parent
  One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
  - jsd 8 Apr 2025 3:57 UTC
    5 points
    0
    Parent
    there’s this https://github.com/Jellyfish042/uncheatable_eval
    - gwern 9 Apr 2025 16:58 UTC
      5 points
      0
      Parent
      It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
    - Ram Potham 8 Apr 2025 20:14 UTC
      1 point
      0
      Parent
      This is great! Would like to see a continually updating public leaderboard of this.
  - Ram Potham 8 Apr 2025 20:13 UTC
    1 point
    0
    Parent
    Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
    I agree we should see a continued compression benchmark.