I do agree that METR’s horizon work is definitely overrelied on (there’s only a few datapoints and there are reasons to believe that the benchmark is biased towards tasks that require little context or memory, among other issues), but I do think the exponential growth in AI capabilities is very plausible a priori, and I wrote up a post on why this should generally be expected (though a caveat is that the doubling times can differ dramatically, so we do need to make sure that we aren’t overextrapolating from a narrow selection of tasks), so I think METR’s observation of exponential growth is likely to generalize to messy tasks, it’s just that the time horizons and doubling factors are different.
I do agree that METR’s horizon work is definitely overrelied on (there’s only a few datapoints and there are reasons to believe that the benchmark is biased towards tasks that require little context or memory, among other issues), but I do think the exponential growth in AI capabilities is very plausible a priori, and I wrote up a post on why this should generally be expected (though a caveat is that the doubling times can differ dramatically, so we do need to make sure that we aren’t overextrapolating from a narrow selection of tasks), so I think METR’s observation of exponential growth is likely to generalize to messy tasks, it’s just that the time horizons and doubling factors are different.