jsd comments on Ram Potham’s Shortform

jsd 8 Apr 2025 3:57 UTC
5 points
0
there’s this https://github.com/Jellyfish042/uncheatable_eval
- gwern 9 Apr 2025 16:58 UTC
  5 points
  0
  Parent
  It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
- Ram Potham 8 Apr 2025 20:14 UTC
  1 point
  0
  Parent
  This is great! Would like to see a continually updating public leaderboard of this.