Hi, thanks for engaging with our work (and for contributing a long task!).
One thing to bear in mind with the long tasks paper is that we have different degrees of confidence in different claims. We are more confident in there being (1) some kind of exponential trend on our tasks, than we are in (2) the precise doubling time for model time horizons, than we are in (3) the exact time horizons on these tasks, than we are about (4) the degree to which any of the above generalizes to ‘real’ software tasks.
When I tried redoing it on fully_private tasks only, the Singularity was rescheduled for mid-2039 (or mid-2032 if you drop everything pre-GPT-4o)
I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks? (If so, I think using the term singularity for this isn’t accurate.)
More importantly though, if your mid 2039 result is referring to 50% @ 167 hours, that would be surprising! If our results were this sensitive to task privacy then I think we would need to rethink our task suite and analysis. For context, in appendix D of the paper we do a few different methodological ablations, and I don’t think any produced results as far off from our mainline methodology as 2039.
My current guess is that your 2039 result is mostly driven by your task filtering selectively removing the easiest tasks, which has downstream effects that significantly increase the noise in your time horizon vs release date fits. If you bootstrap what sort of 95% CIs do you get?
The paper uses SWAA + a subset of HCAST + RE-Bench as the task suite. Filtering out non-HCAST tasks removes all the <1 minute tasks (all <1 min tasks are SWAA tasks). SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance.[1] Removing SWAA means pre-2023 models fail all tasks (as you observed), which means a logistic can’t be fit to the pre-2023 models,[2] and so the pre-2023 models end up with undefined time horizons. If we instead remove these pre 2023 models, this makes the log (50% time horizon) vs release date fit much more sensitive to noise in the remaining points. This is in large part because the date-range of the data goes from spanning ~6 years (2019-2025) to spanning ~2 years (2023-2025).
In fact, below is an extremely early version of the time horizon vs release date graph that we made using just HCAST tasks, fitted on some models released within 6 months of each other. Extrapolating this data out just 12 months results in 95% CIs of 50% time horizons being somewhere between 30 minutes and 2 years! Getting data for pre-2023 models via SWAA helped a lot with this problem, and this is in fact why we constructed the SWAA dataset and needed to test pre-GPT-4 models.
SWAA was made by METR researchers specifically for the long tasks paper, and we haven’t published the problems.
For RE-Bench, we are about as confident as we can be that the RE-Bench tasks are public_problem level tasks, since we created them in-house also. We zip the solutions in the RE-Bench repo to prevent scraping, ask people to not train on RE-Bench (except in the context of development and implementation of dangerous capability evaluations—e.g. elicitation), and ask people to report if they see copies of unzipped solutions online.
I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks?
. . . not quite: I’d forgotten that your threshold was a man-month, instead of a month of clock time. I’ll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn’t change results much, since 730 hours and 167 hours are right next door on a log(t) scale.
SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance.
Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model.
I’ll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)
Hi, thanks for engaging with our work (and for contributing a long task!).
One thing to bear in mind with the long tasks paper is that we have different degrees of confidence in different claims. We are more confident in there being (1) some kind of exponential trend on our tasks, than we are in (2) the precise doubling time for model time horizons, than we are in (3) the exact time horizons on these tasks, than we are about (4) the degree to which any of the above generalizes to ‘real’ software tasks.
I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks? (If so, I think using the term singularity for this isn’t accurate.)
More importantly though, if your mid 2039 result is referring to 50% @ 167 hours, that would be surprising! If our results were this sensitive to task privacy then I think we would need to rethink our task suite and analysis. For context, in appendix D of the paper we do a few different methodological ablations, and I don’t think any produced results as far off from our mainline methodology as 2039.
My current guess is that your 2039 result is mostly driven by your task filtering selectively removing the easiest tasks, which has downstream effects that significantly increase the noise in your time horizon vs release date fits. If you bootstrap what sort of 95% CIs do you get?
The paper uses SWAA + a subset of HCAST + RE-Bench as the task suite. Filtering out non-HCAST tasks removes all the <1 minute tasks (all <1 min tasks are SWAA tasks). SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance.[1] Removing SWAA means pre-2023 models fail all tasks (as you observed), which means a logistic can’t be fit to the pre-2023 models,[2] and so the pre-2023 models end up with undefined time horizons. If we instead remove these pre 2023 models, this makes the log (50% time horizon) vs release date fit much more sensitive to noise in the remaining points. This is in large part because the date-range of the data goes from spanning ~6 years (2019-2025) to spanning ~2 years (2023-2025).
In fact, below is an extremely early version of the time horizon vs release date graph that we made using just HCAST tasks, fitted on some models released within 6 months of each other. Extrapolating this data out just 12 months results in 95% CIs of 50% time horizons being somewhere between 30 minutes and 2 years! Getting data for pre-2023 models via SWAA helped a lot with this problem, and this is in fact why we constructed the SWAA dataset and needed to test pre-GPT-4 models.
SWAA was made by METR researchers specifically for the long tasks paper, and we haven’t published the problems.
For RE-Bench, we are about as confident as we can be that the RE-Bench tasks are public_problem level tasks, since we created them in-house also. We zip the solutions in the RE-Bench repo to prevent scraping, ask people to not train on RE-Bench (except in the context of development and implementation of dangerous capability evaluations—e.g. elicitation), and ask people to report if they see copies of unzipped solutions online.
(I’d guess the terminal warnings you were seeing were about failures to converge or zero variance data).
. . . not quite: I’d forgotten that your threshold was a man-month, instead of a month of clock time. I’ll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn’t change results much, since 730 hours and 167 hours are right next door on a log(t) scale.
Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model.
I’ll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)