Nikola Jurkovic comments on METR: Measuring AI Ability to Complete Long Tasks

Nikola Jurkovic 19 Mar 2025 16:08 UTC
LW: 25 AF: 11
6
AF
This has been one of the most important results for my personal timelines to date. It was a big part of the reason why I recently updated from ~3 year median to ~4 year median to AI that can automate >95% of remote jobs from 2022, and why my distribution overall has become more narrow (less probability on really long timelines).
- No77e 19 Mar 2025 17:46 UTC
  3 points
  0
  Parent
  Naively extrapolating this trend gets you to 50% reliability of 256-hour tasks in 4 years, which is a lot but not years-long reliability (like humans). So, I must be missing something. Is it that you expect most remote jobs not to require more autonomy than that?
  - Zach Stein-Perlman 19 Mar 2025 19:06 UTC
    9 points
    7
    Parent
    I think doing 1-week or 1-month tasks reliably would suffice to mostly automate lots of work.
  - Nikola Jurkovic 19 Mar 2025 19:05 UTC
    5 points
    3
    Parent
    I expect the trend to speed up before 2029 for a few reasons:
    AI accelerating AI progress once we reach 10s of hours of time horizon.
    The trend might be “inherently” superexponential. It might be that unlocking some planning capability generalizes very well from 1-week to 1-year tasks and we just go through those doublings very quickly.
    - Daniel Kokotajlo 19 Mar 2025 20:34 UTC
      5 points
      2
      Parent
      Indeed I would argue that the trend pretty much has to be inherently superexponential. My argument is still kinda fuzzy, I’d appreciate help in making it more clear. At some point I’ll find time to try to improve it.
  - Thomas Kwa 21 Mar 2025 0:43 UTC
    4 points
    2
    Parent
    The trend probably sped up in 2024. If the future trend follows the 2024--2025 trend, we get 50% reliability at 167 hours in 2027.
- MichaelDickens 20 Mar 2025 2:20 UTC
  1 point
  0
  Parent
  Why do you think this narrows the distribution?
  
  I can see an argument for why, tell me if this is what you’re thinking–
  
  The biggest reason why LLM paradigm might never reach AI takeoff is that LLMs can only complete short-term tasks, and can’t maintain coherence over longer time scales (e.g. if an LLM writes something long, it will often start contradicting itself). And intuitively it seems that scaling up LLMs hasn’t fixed this problem. However, this paper shows that LLMs have been getting better at longer-term tasks, so LLMs probably will scale to AGI.