Your link to chatgpt’s analysis is not loading for me, but I don’t really trust chatgpt to do this without mistakes anyway; when I run this regression in python statsmodels with METR’s official values and release dates from their github (plus the recent gpt-5 models), it is stat sig for this OpenAI instruct-model trend (p ~ .04) and it gets more stat sig once we include Claude 4.5. I don’t see what dates you are using, but your beta values have a fair amount of (rounding?) error compared to METR’s data, and the model list isn’t the same, e.g. they didn’t include non-frontier mini models, etc. (btw, note that one-sided p-values are appropriate in this case since only positive slope would be treated as evidence of catching up).
But again, the main question from that particular trend/argument was whether LLMs were catching up with the human baseline slope, which Claude now has basically done for the existing baselines, so that seems pretty strongly confirmatory. It’s true that Claude has a somewhat worse logistic intercept than gpt-5 models (associated w/ a bit worse short-task reliability), but it is still better than the human baseline intercept, and the net effect is that Claude beats the human baselines over a much longer range of horizons than other models (per METR’s logistic fits).
As far as what it would mean for humans to not be able to complete really long/difficult tasks at high reliability, I wonder if that comes back to this argument from the post:
“Many people have the intuition that humans can handle tasks of arbitrary length at high reliability, but of course that depends on task difficulty, and while we can extrapolate the METR curve to weeks or months, the existing/actual tasks are short, so it’s not clear how to estimate the difficulty of hypothetical long METR tasks. There is a tendency to assume these would just be typical long software engineering tasks (e.g. merely time-consuming due to many fairly straightforward subtasks), but there is not much basis for that assumption, as opposed to longer tasks on this length/difficulty trend being more like ‘prove this tricky mathematical theorem’, etc”
(edited)
@Michaël Trazzi Actually, it’s the opposite, the Claude progress was dominated by slope (β) improvement, and intercept actually got a bit worse: Is METR Underestimating LLM Time Horizons?