10x more training compute = 5x greater task length (kind of)

I assume you are familiar with the METR paper: https://​​arxiv.org/​​abs/​​2503.14499

In case you aren’t: the authors measured how long it takes a human to complete some task, then let LLMs do those tasks, and then calculated task length (in human time) such that LLMs can successfully complete those tasks 50%/​80% of the time. Basically, “Model X can do task Y with Z% reliability, which takes humans W amount of time to do.”

Interactive graph: https://​​metr.org/​​blog/​​2025-03-19-measuring-ai-ability-to-complete-long-tasks/​​

In the paper authors plotted task length as a function of release date.

Note that for 80% reliability the slope is the same.

IMO this is by far the most useful paper for predicting AI timelines. However, I was upset that their analysis did not include compute. So I took task lengths from the interactive graph (50% reliability), and I took estimates of training compute from EpochAI: https://​​epoch.ai/​​data/​​notable-ai-models

Here is the result:

(I didn’t add labels for every model because the graph would be too cluttered)

Increasing training compute by 10 times increases task length by 10^0.694≈5 times.

...well, kind of. This graph does not take into account improvements in algorithmic efficiency. For example, Qwen2 72B has been trained using roughly the same amount of compute as GPT-3.5, but has a much longer task length. Based on this data, I cannot disentangle improvements from more physical compute and improvements from tweaks to the Transformer architecture/​training procedure. So it’s more like “Increasing training compute by a factor of 10 while simultaneously increasing algorithmic efficiency by some unknown amount increases task length by a factor of 5.”

Speculation time: at what task length would an AI be considered ASI? Well, let’s say a “weak ASI” is one that can complete a task that takes a human 100 years (again, 50% reliability). And let’s say a “strong ASI” is one that can complete a task that takes a human 5000 years; granted, at that point this way of measuring capabilities breaks down, but whatever.

To train the “weak ASI” you would need around 10^35 FLOPs. For the “strong ASI” you would need around 10^37 FLOPs.
Optimistically, 10^31 FLOPs is about as much as you can get if you buy 500,000 units of GB300 and train an LLM in FP8 at 90% utilization for 365 days nonstop. That would correspond to task length of around 2.5 months. “Optimistically” because right now you can’t buy 500 thousand of those things even if you had infinite money, and even if you could buy that many, you would probably want to use them for other purposes (such as running older models or experiments with novel architectures) as well, and not just to train one giant model for a year. So this is more of a an upper bound.

If these numbers are even remotely within the right ballpark, training ASI will be impossible without either software or hardware breakthroughs (such as photonic chips or analog devices for matrix multiplication) that break the current trend. If the current trend continues, then we won’t see ASI for a looooooooong time.

Caveats:

  1. The fit is not very good, R^2=0.73 (though bear in mind that the models used here aren’t quite the same as in the METR paper). It’s interesting that task length is better correlated with release date than with training compute; I was not expecting that.

  2. EpochAI lacks a lot of data on training compute, in particular data on o1, o3, Gemini 2.5 Pro and Flash, Claude 4 Opus and Sonnet. Conversely, METR lacks data on models such as Grok and Llama. More data points would be nice.

I would love if the authors of the METR paper did a similar analysis, but with more models. Maybe they could disentangle the benefits of increasing training compute from the benefits of improvements in algorithmic efficiency to get two independent numbers.

EDIT: a little too late, but here:

Using only GPT models gives a much clearer picture with a much better fit, though it also means that I only have 4 data points to work with.

According to this graph, increasing training compute by 10 times results in 10^0.509≈3.2 times greater task time.