StanislavKrym comments on Is METR Underestimating LLM Time Horizons?

StanislavKrym 18 Jan 2026 4:02 UTC
−1 points
−1
Claude now has a horizon of 444 billion minutes(!)
Could you actually provide a citation for Claude already being a supercoder? If you can’t, then either the model has wrong parameters or is wrong wholesale. What I expect is that the time horizon is exponential until the last few doublings, not hyperbolic. Additionally, I suspect that Claude’s horizon is more due to surprisingly low performance on some tasks well below the alleged horizon.
As for models being woefully outmatched by humans before a specific period when your “benchmark” skyrocketed, it means a different thing. Recall that the METR graph had the models’ performance rise quickly, then slowly until the spike at the very end, that the AI-2027 forecast estimated the human speed of thought as 10-20 tokens/sec and that the models, unlike the humans, can only have the CoT and stuff tokens into the mechanism which ejects the next one without learning anything from experience.
Were METR’s baselining process simulated and placed into the METR graph with 20 tokens/simulated second of a human doing the task, the performance on ranges higher than the time necessary to introduce the baseliners to the tasks would likely resemble a straight line where 1K tokens is a bit lower than a minute and 100K tokens are a between 1 hr and 2 hrs. I manually edited the line into the graph and added another line where the hypothetical model requires 100 times more tokens. The models first don’t display progress at all, then proceed faster (OpenAI’s models) or about as fast as the humans (GPT-4o, Claude Sonnet 4.5, Grok 4), then ALL models begin to proceed far slower, almost as if their competence is exhausted by harder tasks, and finally models since o3 display a jump, as if they did something, weren’t confident, but decided to submit anyway.
- andreasrobinson 18 Jan 2026 5:26 UTC
  5 points
  0
  Parent
  Could you actually provide a citation for Claude already being a supercoder?
  I did not claim that Claude is a “supercoder” or even human-level at coding; rather, the Claude addendum continued with: “to be clear, we shouldn’t over-interpret this specific 444 billion figure” and “Realistically, this highlights that to really make accurate projections of the time to catch up with human horizons based on METR data, we need better human baselines.” In my view, the natural takeaway is that that Claude has now basically caught up with METR’s existing human baselines, which they have acknowledged were not that well incentivized, which does not mean that it is better than properly incentived software engineers.
  However, per the sensitivity analysis, if we assume well incentivized humans could do ~2x better than METR’s baselines on METR’s longest benchmark tasks, “then Claude 4.5 Opus has an intersection-based time horizon of only 35.9 minutes”, ie far from human-level. So as I said in the post, I do think this highlights the need for better human baselines for METR, but while the current horizon estimates are quite sensitive to the baselines, the estimated time to human-level doesn’t actually shift that much with this stronger baseline, i.e. from early 2026 to late 2026.
  In general, the primary point of the post wasn’t that the current baselines are good enough to make an accurate prediction of human-level horizons using METR data, but rather “my main takeaway from this analysis is probably that we shouldn’t over-interpret the METR trends at fixed reliability as a direct marker of progress towards human-level software horizons” (because the METR metrics are likely underestimating the progress rate, due to using fixed reliabilities at all horizons)
  “What I expect is that the time horizon is exponential until the last few doublings, not hyperbolic.”
  I provided both theoretical and statistical arguments (e.g. AIC) in the post for why the human-relative time horizon trend is likely hyperbolic rather than exponential, and your comment does not address or acknowledge either of those arguments. Note the post does argue that METR’s metrics likely are exponential, so the hyperbolic claim is specifically about human-relative time horizon metrics (per the proposal in the post).
  - StanislavKrym 18 Jan 2026 21:26 UTC
    1 point
    0
    Parent
    I notice that I am confused. You imply that the human-equivalent horizon of a model is $H_{h} = 2^{\frac{α_{h} - α}{β - β_{h}}}$ Then the LOGARITHM of $H_{h}$ is $\frac{α_{h} - α}{β - β_{h}}$ and it is the LOGARITHM which likely behaves linearly if $β$ is constant and $α$ changes linearly or hyperbolically if $β$ changes linearly. Alas, $β$ doesn’t change linearly across models. Instead, as far as I understand, $β$ is calculated as ${(log \frac{T H_{50}}{T H_{80}})}^{- 1}$ Were $β$ monotoneous, we would also expect monotoneous changes in the ratio of time horizons $\frac{T H_{50}}{T H_{80}}$ Instead, the ratios are this. Setting aside Claude Opus 4.5 with the ratio equal to 10.64, the next two biggest ratios are displayed by DeepSeek R1-0528 (8.53) and Grok 4 (7.14). Therefore, the ratio of the time horizons did NOT display a consistent trend at least before Claude Opus 4.5.
    - andreasrobinson 19 Jan 2026 6:29 UTC
      1 point
      0
      Parent
      If you run a linear regression on $β$ versus time, then the regression line does have a positive slope estimate (even pre-claude-4.5), and I used that regression line in section 2.2.1.1 to provide an alternative estimate for when LLMs will catch up with human $β$ (at which point the denominator goes to zero, and the LLM’s horizon blows up). That said, the $β$ trend is quite noisy; and while the OpenAI trend from that estimate was already stat sig, the overall $β$ trend across companies did not hit stat sig until the Claude Opus 4.5 datapoint. Also, as I argued in the post (partly due to the noisy $β$ trend): “I suspect it makes more sense to directly extrapolate the overall time-horizon estimate rather than linearly extrapolating the noisy logistic coefficient in isolation, even if the slope trend is a useful intuition-pump for seeing why a finite-time blowup is plausible” (btw, the specific projection in that section was based on just OpenAI instruct-tuned models, i.e. post gpt-3 models)
      Also, the underlying question in this case was whether the LLM slope would catch up to human baseline slope, but it’s somewhat moot since that has basically happened with Claude 4.5; and if METR were to collect better incentivized human baselines (with better slope $β$ ), it seems quite likely that there would be a similar catch up to match this improved $β$ , leading to another blowup in the updated (human-relative) LLM time horizons.
      (edited)
      - StanislavKrym 19 Jan 2026 15:43 UTC
        1 point
        0
        Parent
        The trend itself was this:
        gpt_3_5_turbo_instruct 3.49
        gpt_4 5.54
        gpt_4_0125 4.47
        gpt_4_1106 5.87
        gpt_4_turbo 4.30
        gpt_4o 5.48
        o1_preview 4.76
        o1_elicited 6.76
        o3 4.39
        o4-mini 4.93
        gpt_5 5.18
        gpt_5_1_codex_max 5.36
        Neither I nor GPT-5.2 believe that THIS trend is consistent enough. Additionally, Claude Opus 4.5 had its share of doubts cast upon the abnormally high 50% time horizon. Finally, what would it mean for a hired human to have a 50% or 80% chance of succeeding at year-long tasks? That the human cannot do the task ~at all, even given 10 years? But even this example is not that an example...
        andreasrobinson 19 Jan 2026 19:21 UTC
        1 point
        −1
        Parent
        Your link to chatgpt’s analysis is not loading for me, but I don’t really trust chatgpt to do this without mistakes anyway; when I run this regression in python statsmodels with METR’s official $β$ values and release dates from their github (plus the recent gpt-5 models), it is stat sig for this OpenAI instruct-model trend (p ~ .04) and it gets more stat sig once we include Claude 4.5. I don’t see what dates you are using, but your beta values have a fair amount of (rounding?) error compared to METR’s data, and the model list isn’t the same, e.g. they didn’t include non-frontier mini models, etc. (btw, note that one-sided p-values are appropriate in this case since only positive slope would be treated as evidence of catching up).
        But again, the main question from that particular trend/argument was whether LLMs were catching up with the human baseline slope, which Claude now has basically done for the existing baselines, so that seems pretty strongly confirmatory. It’s true that Claude has a somewhat worse logistic intercept than gpt-5 models (associated w/ a bit worse short-task reliability), but it is still better than the human baseline intercept, and the net effect is that Claude beats the human baselines over a much longer range of horizons than other models (per METR’s logistic fits).
        As far as what it would mean for humans to not be able to complete really long/difficult tasks at high reliability, I wonder if that comes back to this argument from the post:
        “Many people have the intuition that humans can handle tasks of arbitrary length at high reliability, but of course that depends on task difficulty, and while we can extrapolate the METR curve to weeks or months, the existing/actual tasks are short, so it’s not clear how to estimate the difficulty of hypothetical long METR tasks. There is a tendency to assume these would just be typical long software engineering tasks (e.g. merely time-consuming due to many fairly straightforward subtasks), but there is not much basis for that assumption, as opposed to longer tasks on this length/difficulty trend being more like ‘prove this tricky mathematical theorem’, etc”
        (edited)