I think once you assume a logistic function, its almost guaranteed that if a new model solves one additional task, it’s going to continue the log-linear trend.
No, whether it continues the log-linear trend depends on WHEN a new model solves one additional task.
Using the logistic function to fit task success probability vs task length is not load-bearing for the time horizon computation.
If the task lengths in the data set were uniformly distributed (in log-space) you could just take the overall accuracy and look up the task length of that percentile in the data and that would be nearly identical to the 50% horizon. This would replace the logistic assumption with a step-function assumption, but because the logistic is point symmetric in the 50% point, it’s roughly the same value.
Put differently: There is an interval where the model goes from basically 100% to basically 0% and the logistic just takes the point in the middle for the 50% horizon. And there are many other methods that also take the point in the middle (possibly in a less robust way, but that would just make the trend a little more noisy).
I think your feeling that this is suspect comes more from the choice of the log-space not the choice of the fitting function. It feels a bit circular to say “we gonna assume that the log-space of task length is the natural way to look at model competence” and then get the result that through linear time model competence improves linearly in that log-space. I think it is also rather this choice of log-space not of the logistic that is motivated by the log-linear plot of model success rate vs human time to complete.
Also I would point out that the validity of the time horizons computed for the current models is not just based on these 16 tasks, but on the preceding six year trend + replications of exponential trends in other datasets. It’s great to point out that current measurements have a ton of noise and are very gameable, but it’s hard to attack the conclusion of exponential progress in time horizons.
I skimmed the paper last week but I lost interest when I couldn’t find out which models were used.