Epoch AI’s new evaluation for Gemini 2.5 Flash Preview is broken.
On their AI Benchmarking dashboard, the newest Gemini 2.5 Flash model is listed as having an accuracy of 4% ± 0.71% on GPQA Diamond, when Google’s official announcement lists it at over 80%, and when GPQA is a multiple-choice test with 4 options:
It’s because of formatting issues. Helpfully, Epoch provides the logs from the evaluation, and the model just simply hasn’t been responding in the correct format.
For example, if you look at the first sample from the logs, the correct answer is listed as “B”, but the model answered in Latex, $\boxed{B}$, so it was scored incorrect. There are plenty of other examples like this.
Thanks for your comment, I’m happy you found the logs helpful! I wouldn’t call the evaluation broken—the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ (“Why do some models underperform the random baseline?”), but I think we’re also going to add a clarifying note about it in the tooltip.
While I do think “how well do models respect the formatting instructions in the prompt” is also valuable to know, I agree that I’d want to disentangle that from “how good are models at reasoning about the question”. Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we’re just pretty strapped on engineering capacity at the moment :)
ETA: since it’s particularly extreme in this case I plan to hide this evaluation until we have the new scorer added
Epoch AI’s new evaluation for Gemini 2.5 Flash Preview is broken.
On their AI Benchmarking dashboard, the newest Gemini 2.5 Flash model is listed as having an accuracy of 4% ± 0.71% on GPQA Diamond, when Google’s official announcement lists it at over 80%, and when GPQA is a multiple-choice test with 4 options:
It’s because of formatting issues. Helpfully, Epoch provides the logs from the evaluation, and the model just simply hasn’t been responding in the correct format.
For example, if you look at the first sample from the logs, the correct answer is listed as “B”, but the model answered in Latex, $\boxed{B}$, so it was scored incorrect. There are plenty of other examples like this.
[I work at Epoch AI]
Thanks for your comment, I’m happy you found the logs helpful! I wouldn’t call the evaluation broken—the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ (“Why do some models underperform the random baseline?”), but I think we’re also going to add a clarifying note about it in the tooltip.
While I do think “how well do models respect the formatting instructions in the prompt” is also valuable to know, I agree that I’d want to disentangle that from “how good are models at reasoning about the question”. Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we’re just pretty strapped on engineering capacity at the moment :)
ETA: since it’s particularly extreme in this case I plan to hide this evaluation until we have the new scorer added
@Jsevillamol