Thomas Kwa comments on METR: How Does Time Horizon Vary Across Domains?

Thomas Kwa 21 Jul 2025 18:37 UTC
7 points
0
Since publishing this we have gotten two comments:
- Someone pointed out that Mock AIME actually has human time annotations. We already had item-level agent success data from Epoch AI, so it joins GPQA Diamond in having high-quality data in both categories.
- In May, Palisade found that the frontier time horizon on cyber offense CTFs is around 1 hour
We’re looking into incorporating both of these if we find the data is compatible with ours!
- Thomas Kwa 26 Sep 2025 22:30 UTC
  2 points
  0
  Parent
  I’m dropping this because it probably won’t change conclusions much and on the margin, new more realistic evals seem like they would add more realism per unit of effort. The code is open-source and if anyone writes the PR for me, I’ll probably merge it.
- Expertium 21 Jul 2025 18:49 UTC
  1 point
  0
  Parent
  Another suggestion: https://cybench.github.io/
  - Thomas Kwa 21 Jul 2025 18:54 UTC
    2 points
    1
    Parent
    The issue with Cybench is its difficulty annotations are “first solve time” which we don’t know how to compare with median / average solve time among experts. If we get better data, it could be comparable.