Since publishing this we have gotten two comments:
Someone pointed out that Mock AIME actually has human time annotations. We already had item-level agent success data from Epoch AI, so it joins GPQA Diamond in having high-quality data in both categories.
In May, Palisade found that the frontier time horizon on cyber offense CTFs is around 1 hour
We’re looking into incorporating both of these if we find the data is compatible with ours!
I’m dropping this because it probably won’t change conclusions much and on the margin, new more realistic evals seem like they would add more realism per unit of effort. The code is open-source and if anyone writes the PR for me, I’ll probably merge it.
The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.
Since publishing this we have gotten two comments:
Someone pointed out that Mock AIME actually has human time annotations. We already had item-level agent success data from Epoch AI, so it joins GPQA Diamond in having high-quality data in both categories.
In May, Palisade found that the frontier time horizon on cyber offense CTFs is around 1 hour
We’re looking into incorporating both of these if we find the data is compatible with ours!
I’m dropping this because it probably won’t change conclusions much and on the margin, new more realistic evals seem like they would add more realism per unit of effort. The code is open-source and if anyone writes the PR for me, I’ll probably merge it.
Another suggestion: https://cybench.github.io/
The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.