ryan_greenblatt comments on AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines

ryan_greenblatt 6 Apr 2026 23:51 UTC
LW: 9 AF: 5
1
AF
This is a good point and I broadly agree with what you’re saying. Some possible disagreements:

Once you are talking about scaling to arbitrary inference costs, I think the relevant notion of time horizon is closer to: “For what T could you solve this task using an arbitrary supply of humans each of whom only works on the project for T hours?”

Maybe I should have made this clearer, but I’m not talking about scaling to arbitrary inference costs. Instead, I’m talking about scaling to inference costs that are a moderate fraction of the human task completion cost. (E.g., 1%-100% depending on the task.) I think you’d want to compare the AI performance at some inference budget to human labor with some limit in supply.

So I’m not sure how much of the phenomenon you’re observe is “there is a way longer horizon for these tasks” rather than “a more careful definition of horizon is more important for these tasks.” Probably some of both, but It seems quite possible that for the tasks you are describing the horizon length is really more like a few days or a week than months or a year.

Yep, it seems reasonably likely that on this alternative notion of time horizon, the horizon length is more like a few days. However, under this alternative definition, relatively small time-horizon values may correspond to much larger (real-world) impact. For example, an AI with a decomposed-time-horizon of ~1 week could potentially speed up AI R&D a lot, while under the original time-horizon notion, a 1-week 50%-reliability time horizon is much less of a big deal. (To be more precise about the decomposed-time-horizon metric: this would correspond to AI having a 50% chance of matching a team of humans where each human gets 1 week before being replaced, with total labor-hours capped at ~100x what the task would normally require to avoid degeneracies from truly arbitrary inference spend.) So the key question becomes: how much can you accomplish if the task must be heavily decomposed but you can apply much more labor? I’m pretty uncertain about this for the tasks we care about.

Further, AIs may soon end up being especially good (superhuman?) at working in heavily decomposed contexts—e.g., good at leaving notes to other instances, good at quickly picking up project state from limited context. This makes the correspondence to doing task decomposition with humans more fraught, even though AIs will still do relatively better at tasks that are easier to do incrementally or otherwise decompose. I think it’s already the case that AIs are extremely good relative to humans at loading up complex state so long as that state is written down (reasonably clearly).

I was already pretty uncertain about how to translate time horizon into downstream impacts, and while this alternative metric might be more consistent across different groups of tasks, it makes translating to downstream impact harder. An alternative is just explicitly thinking about the time horizon for different buckets of tasks using the original notion.

Regardless, I agree that AI capabilities may be better understood using this alternative time-horizon notion, and this does closely correspond to how AIs are actually being used. (My scaffold is mostly just doing task decomposition, and something like this is effectively required for good performance given current AI properties.)
- paulfchristiano 7 Apr 2026 4:18 UTC
  LW: 10 AF: 8
  0
  AF Parent
  Maybe I should have made this clearer, but I’m not talking about scaling to arbitrary inference costs. Instead, I’m talking about scaling to inference costs that are a moderate fraction of the human task completion cost. (E.g., 1%-100% depending on the task.) I think you’d want to compare the AI performance at some inference budget to human labor with some limit in supply.
  I agree. In the more realistic regime you are talking about you have some more complicated quantitative question around how large are the slowdowns from task decomposition into what scale.
  My main point was that for the tasks we are talking about here, the slowdowns seem like they might not be that large even for modest human time horizons. (In contrast with some of the crazy factored cognition stuff we have sometimes talked about, which involves much shorter horizons, much harder-to-decompose tasks, and much larger slowdowns.)
  However, under this alternative definition, relatively small time-horizon values may correspond to much larger (real-world) impact.
  I agree that this could lead to large impacts with relatively short horizons (perhaps even today’s horizons, with an appropriately broadened training distribution and a bunch of schlep). That does imply a different picture of AI strengths and weaknesses (e.g. weaker on-the-job learning with performance mostly limited to domains near the training distribution; differential speedup for tasks that are easily decomposed), with a more schlep-heavy singularity, a greater role for tight human involvement later in the process, and probably less alignment concern earlier in the trajectory.