gwern comments on Ram Potham’s Shortform

gwern 8 Apr 2025 2:41 UTC
29 points
11
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
- jsd 8 Apr 2025 3:57 UTC
  5 points
  0
  Parent
  there’s this https://github.com/Jellyfish042/uncheatable_eval
  - gwern 9 Apr 2025 16:58 UTC
    5 points
    0
    Parent
    It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
  - Ram Potham 8 Apr 2025 20:14 UTC
    1 point
    0
    Parent
    This is great! Would like to see a continually updating public leaderboard of this.
- Ram Potham 8 Apr 2025 20:13 UTC
  1 point
  0
  Parent
  Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
  I agree we should see a continued compression benchmark.