This is extremely informative, especially the bit about the holdout set. I think it’d reassure a lot of people about the FrontierMath’s validity to know more here. Have you used it to assess any of OpenAI’s models? If so, how, and what were the results?
This is extremely informative, especially the bit about the holdout set. I think it’d reassure a lot of people about the FrontierMath’s validity to know more here. Have you used it to assess any of OpenAI’s models? If so, how, and what were the results?