An apple picking model for AI R&D

Link post

As we move into the era of Claude Opus 4.5 and Mythos, an underrated question is how these models will impact AI R&D, and Tom Cunningham makes a very underrated point:

It is possible to have autonomous AI research and for the AI researchers to have diminishing returns, such that you want to spend on agents first, then humans, unless AI completely closes the loop on AI R&D such that humans no longer have value in AI R&D, and a lot of models that predict an AI explosion rely on AI R&D completely being automated by AI, or something else being fully automated by AI R&D like Davidson and Eth (2025), or Davidson et al (2026).

Here’s some informative sections of the post:

Time horizons as measured by METR are meaningful metrics of progress

Because of the tortoise-hare behavior of agents we can calibrate an agent’s ability by the point at which a human and an agent, given equal expenditure, will make equal progress. This is, very loosely, the way agent time horizon is identified in Wijk et al. (2025) and Kwa et al. (2025).

An important implication is that an agent’s time horizon is sensitive to the starting point, in a way that differs from human effort. If we have a starting-point that has only been optimized by humans we expect agents can push it forward a lot. But if we have already applied some agent labor to the algorithm then further agentic labor will have much lower returns, i.e. time horizons will be much shorter. Concretely: one agent can be as good as one human (or human-week), but two agents are not as good as two humans.

Based on a very loose reading of the evidence we could say that agents (as of March 2026) are able to push forward the frontier on optimization problems by the equivalent of around a month of professional effort. However they then hit a wall and need either stronger models or better harnesses.

Apple picking implies a one-time jump in progress

The apple-picking model implies a different pattern: for each model generation, agents will autonomously advance the frontier, but they will then quickly hit diminishing returns. When the marginal returns to human and agent expenditure are equalized, then we will return to investment in human optimization:

We give quotes below from a variety of domains with claims that (1) agents are autonomously advancing the frontier; (2) those advances have hit diminishing returns. We show below in the theory section that the optimal allocation of expenditure will be to first invest in agentic optimization, then switch back to humans.

As new more powerful agents are released, we should expect a sort-of punctuated equilibrium, as each successive branch of apples is picked. Terence Tao says, on Erdos problems: “Maybe the next time there’s a big advance in the models, they will try it again, and a few more will be breached.”

In reality aggregate progress is likely to appear smooth for a few reasons: (1) models are released at a quick cycle, and harnesses are constantly being updated; (2) each human discovery opens up room for agent discoveries (assumed away in the model); (3) LLMs are used to augment human activity, as well as autonomously do R&D.

Agents are improving on the state-of-the-art in well-studied optimization problems.

  • Andrej Karpathy’s autoresearch (March 2026) gets an agent to reducing validation loss from pretraining a GPT-2-small model given a fixed compute budget (one H100, ~5 minutes per training loop). Over ~2 days the agent tried ~700 changes and found ~20 additive edits, yielding an ~11% improvement in “Time-to-GPT-2”. Andrej Karpathy says “all the adjustments are ‘real’, I didn’t find them manually previously, and they stack up and actually improved nanochat.”

  • nanoGPT speedrun (Jordan and contributors (2026)) is a public competition to minimize training time for GPT-2 given a fixed target loss, which has brought training time from 45 min down to 1.4 min over 77 records since May 2024. Four recent improvements are tagged as contributed with the help of “AI systems”.

  • TTT-Discover (Yuksekgonul et al. (2026), January 2026), a test-time training method, optimized the TriMul GPU kernel used in AlphaFold, achieving >15% improvement over the best human implementations. The authors of the TriMul task, expert kernel engineers, called it “legit” and noted the strategy was “similar to the current best humans, but executed better,” with most human solutions falling behind on fusing more complex operators together.

Some optimizations are deeper than others

The nanoGPT speedrun provides a useful case study as a public ledger of cumulative human effort on an AI R&D problem. Some deep contributions, from humans, are:

  • Muon (October 2024) came from original research in the nanogpt codebase on Newton-Schulz orthogonalization, cutting training time by 21% (31.4 → 24.9 min). It has since been adopted widely, including by Kimi K2 (1T MoE), GLM-4.5 (355B MoE), and Arcee Trinity (400B MoE), and is now part of PyTorch’s standard optimizer suite.

  • U-Net skip connections (November 2024) applied an encoder-decoder pattern from 2015 computer vision to transformer layers, yielding an 8% speedup (7.8 → 7.2 min). This became foundational and later records kept building on it.

  • Paired Head Attention (January 2026) is a novel attention mechanism that interleaves K/​Q/​V across head pairs to double the effective sequence length in attention.

These required theoretical insight, cross-domain transfer, or novel architectural ideas.

Agent optimizations are often described as shallow

  • Several AI R&D and optimization benchmarks, such as MLGymBench (Nathani et al. (2025)), GSO (Shetty et al. (2025)), and SWE-fficiency (Ma et al. (2025)), report that agents achieve “surface-level speedups” but “fail to discover algorithmic innovations.” For instance, one of the largest speedups in AlgoTune (Press et al. (2025)) is a 142× on a graph communicability task, achieved by replacing pure Python with BLAS calls.

  • In nanoGPT speedrun the AI-contributed patches appear to be shallow relative to the optimizations above (e.g., Muon’s 21% speedup): replacing Python loops with faster library calls (hiverge.ai, ~1.2%) and combining two GPU operations to avoid writing intermediate results to memory (Locus, ~0.9%). These can be classified as typical optimization techniques that apply to many problems.

  • In autoresearch Karpathy says “It’s not novel, ground-breaking ‘research’ (yet), but all the adjustments are ‘real’”. The improvements that worked were things like adjusting AdamW constants, adding a scalar multiplier for QKnorm, and even making random seed changes. Overall he notes that the agents feel “cagey” on open-ended ideas.

  • Terence Tao has described the contribution of AI to mathematical discovery:

    “What AI has been very good at is systematically exploring this long tail and knocking off the easiest of the problems.” (ref):

    “Fifty-odd problems have been solved with AI assistance, which is great, but there’s like six hundred to go. People are still chipping away at one or two of these right now.” (ref)

Broader implications for the software intelligence explosion debate (not in the link)

There is an Epoch post called “The software intelligence explosion debate needs experiments” talking about the flawed data and models behind the debate over software intelligence explosions. I agree current models and data are quite flawed. I am unfortunately much more pessimistic than them on resolving this uncertainty via experimentation in the near term, because of the fact that experiments done today or in the next 1-2 years would be in the regime where AI models likely cannot fully close the loop on AI R&D automation. If this is the case, then agent returns to research will diminish and not lead to a software intelligence explosion, even if a model that could fully automate AI R&D could cause a software intelligence explosion, because the models that predict software intelligence explosions with high enough probabilities to worry about are also the models where we assume full R&D automation has been achieved.

I’d have a similar worry about other milestones, where full automation of say industry is very different than partial automation of industry, but I restricted it to software for convenience reasons.

However, I also still agree with the article, even if most of the experiments proposed would probably not give us much information right now, and we can’t reduce phase change risk to low enough levels, setting up the infrastructure for AGI labs such that they can run the experiments proposed cheaply matters a lot. Iteration to find good experiment models will take time, and during the early TAI/​crunch period, we will be bottlenecked on researcher hours (for humans and maybe AIs), compared to compute availability (where we are currently rich on researcher hours and poor on compute on a relative scale), for reasons that Cleo Nardo explains.

The experiments proposed in the Epoch article are simultaneously less useful than they think today and yet still very valuable to do, so that we can generate the experiments cheaply when we do need to have them on hand to predict takeoff.

No comments.