Elliot Glazer comments on Don’t over-update on FrontierMath results

Elliot Glazer 13 Mar 2025 18:26 UTC
2 points
1
Yes, the privacy constraints make the implications of these improvements less legible to the public. We have multiple plans for how to disseminate info within this constraint, such as publishing author survey comments regarding the reasoning traces and our competition at the end of the month to establish a sort of human baseline.
Still, I don’t know that the privacy of FrontierMath is worth all the roundabout efforts we must engage in to explain it. For future projects, I would be interested in other approaches to balancing preventing models from training on public discussion of problems vs being able to clearly show the world what the models are tackling. Maybe it would be feasible to do IMO-style releases? “Here’s 30 new problems we collected this month. We will immediately test all the models and then make the problems public.”
- David Matolcsi 15 Mar 2025 9:53 UTC
  5 points
  0
  Parent
  I like the idea of IMO-style releases, always collecting new problems, testing the AIs on them, then releasing to the public. What do you think, how important it is to only have problems with numerical solutions? If you can test the AIs on problems with proofs, then there are already many competitions that regularly release high-quality problems. (I’m shilling KöMaL again as one that’s especially close to my heart, but there are many good monthly competitions around the world.) I think if we instruct the AI to present its solution in one page at the end, then it’s not that hard to get an experience competition grader to read the solution and give it scores according to the normal competitions scores, so the result won’t be much less objective than if it was only numerical solutions. If you want to stick to problems with numerical solutions, I’m worried that you will have a hard time regularly assembling high-quality numerical problems again and again, and even if the problems are released publicly, people will have a harder time evaluating them than if they actually came from a competition where we can compare to the natural human baseline of the competing students.