Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.
Good point. This does update me downward on Deep Think outperforming matharena’s gemini-2.5-pro IMO run as it is possible Deep Think internally was doing a similar selection process to begin with. Difficult to know without randomly sampling gemini-2.5-pro’s answers and seeing how much the best-of-n selection lifted its score.