Yes sorry for just dropping in with “I have a model that gives different results” without actually giving the details. I’m trying to get a minimal version of it written up (I had designed it to integrate into METR’s codebase so need to extract it as something that can exits standalone).
Within the runs.json there is a (not especially clearly named) ‘human_source’ field for each row. If this is set to “baseline” then the task length is based on (one or more) human baseliners, if it is “estimate” then it was just estimated without any human actually finishing the task. These estimates are generally quite noisy—I believe somebody told me something like that for some of the tasks where they had both the estimates and the baseliner times, only 60% of the estimates were within a factor of 3 of the (average) baseliner times.
Because you have a unified sigma parameter for how difficulty-for-LLM differs from log(task_length) this ends up incorporating the estimate noise as an additional source of uncertainty. But if you define the p-time-horizon as I did in my first comment as being defined on baselined tasks only then these lead to different results for the 80% time horizons.
Alexander Barry
I chatted with Thomas a bit about this, and I also agree that the default METR model should also output things that are close to the ‘marginal’ definition of time horizon (or at least as well as it can be approximated with the inverse logit sigmoid).
I think the important thing to realise is that while one needs to take additional steps for the ‘marginal’ approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don’t explicitly account for this (such as the original METR model) should have it naturally learned into the shape of their logistic curve.
(A similar thing is also true for having the discrimination parameter vary by task instead of by model—if it varies by task this uncertainty needs to be accounted for in the time horizon calculations, but since this is not the case in the original METR model it does not).
Good catch! Edited my comment. It had been a while since I had looked at the results and I must have also lost the ability to read in the meantime.
Thanks for the great writeup!
I’m a statistician who does some work with METR, and I recently worked on a very similar project to create a Bayesian version of the Time Horizon model. Mine ended up being somewhat different to yours (mine deviates a bit more from the currently structure of the METR model), but its great to see other people stress testing modelling.
On the 80% Time Horizon results I agree that your ‘marginal’ approach is correct, and it is the one I also took in my model. However my 80% results ended up being a factor of 2 Edit:higher than the results of METR’s current model for recent SOTA LLMs. Here is a quick plot I made just after the Opus 4.5 results came out, using the TH1.0 data:I think there is some natural increase due to how my model’s data is selected however, as my 50% time horizons are also often somewhat higher, and they are mostly within uncertainty bounds anyway:
I’ve taken a very quick look through your code so try and think about the difference, and my guess would be that you find that LLM-difficulty diverges more from log(baseliner_time) time than I do, because you include the tasks with estimated baseliner times when calculating the amount of noise here, whereas I handled tasks with/without baseliner times separately, and only used the former when doing the time horizon calculations.
My definition was: “For a LLM m, I define its ‘p time horizon’ as the delta such that LLM m has (expected) probability p of success on a single attempt at a task with baseliner time delta.” Where we might expect different results for tasks with estimated instead of baselined task lengths because there is effectively another layer of noise added by the estimates.
(I’ll note that out of all the critiques of the Time Horizon work I’m surprised I don’t see more discussion of the tasks which only have estimates, as this seems like one of the most straightforward limitations, and which will only get more relevant as tasks get longer and harder to baseline. Something like only 5⁄30 of the longest tasks currently have baseliner times!)
I’ve love to chat more about Bayesian modelling and general thinking about these kinds of models sometime, and thanks again for the interesting analysis.
While I think it is plausible the results would have been different if the devs had had e.g. 100 hours more experience with cursor, it is worth also noting that:
- 14⁄16 of the devs rated themselves as ‘average’ or above cursor users at the end of the study
- The METR staff working on the project thought the devs were qualitatively reasonable cursor users (based on screen recordings etc.)
So I think it is unlikely the devs were using cursor in an unusually unskilled way.
The forecasters were told that only 25% of the devs had prior cursor experience (the actual number ended up being 44%), and still predicted substantial speedup, so if there is a steep cursor learning curve here that seems like a fact people didn’t expect.
With that all being said the skill ceiling for using AI tools is clearly at least *not being slowed down* (as they could simply not use the AI tools), so it would be reasonable to expect eventually some level of experience would lead to that result.
(I consulted with METR on the stats in the paper, so am quite familiar with it).
I think we agree and I just stated this badly—I was just meaning to say that METR’s original approach is closer to marginal despite them not explicitly doing the integrating over the random effects (although I agree you need to do integrate over the random effects in models that include them to get the marginal time horizon).