MalcolmMcLeod comments on Nobody is Doing AI Benchmarking Right

MalcolmMcLeod 12 Jul 2025 21:12 UTC
3 points
2
My colleagues and I were arguing about the nature of LLM intelligence and generalization. (In particular, they were talking about this paper: [2507.06952] What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models and using Kepler/Newton as an example). This is the only eval I know of that hits the question directly. If you want funding, make a nice website where people can sign up to play it (depending on your worries about data leakage? maybe you’d create a public-set-version?) and you show a good leaderboard. & you can solicit donations. This feels like ARC-AGI-3, sorta. (OTOH, although scientifically interesting, this project might be doom-increasing. “Feeding evals to capabilities labs” and all that. If I were aiming for AGI, this is the benchmark I would hill-climb.)
- Chapin Lenthall-Cleary 18 Jul 2025 6:51 UTC
  2 points
  0
  Parent
  The data contamination and giving the AI labs a good benchmark are very real worries, and are basically the reason we haven’t published Starburst yet (though we have sent it to a few people who wanted to try it). Providing a leaderboard and scores that can be used for projections, fortunately, doesn’t have either of those risks, but still has benefits for use in forecasting, public awareness, etc. (obviously in proportion to how well-known it is). We may make a short version public, since that has no data contamination and less capability-boosting risk. Still undecided on that.
  Regarding the paper, not exactly surprising to see that result for a 109M parameter model trained on a narrow task.
  We do already have a leaderboard. Maybe a dedicated website rather than Substack would be nicer.