Since publishing this we have gotten two comments:
Someone pointed out that Mock AIME actually has human time annotations. We already had item-level agent success data from Epoch AI, so it joins GPQA Diamond in having high-quality data in both categories.
In May, Palisade found that the frontier time horizon on cyber offense CTFs is around 1 hour
We’re looking into incorporating both of these if we find the data is compatible with ours!
The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.
Since publishing this we have gotten two comments:
Someone pointed out that Mock AIME actually has human time annotations. We already had item-level agent success data from Epoch AI, so it joins GPQA Diamond in having high-quality data in both categories.
In May, Palisade found that the frontier time horizon on cyber offense CTFs is around 1 hour
We’re looking into incorporating both of these if we find the data is compatible with ours!
Another suggestion: https://cybench.github.io/
The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.