Head of linear regression at METR.[1]
Previously: MIRI → interp with Adrià and Jason → METR.
I have signed no contracts or agreements whose existence I cannot mention.
- ^
this is a joke, I am a member of technical staff
Head of linear regression at METR.[1]
Previously: MIRI → interp with Adrià and Jason → METR.
I have signed no contracts or agreements whose existence I cannot mention.
this is a joke, I am a member of technical staff
In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticism on LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather whatever I thought of in a few hours.
Time horizon is not the length of time AIs can work independently
Rather, it’s the amount of serial human labor they can replace with a 50% success rate. When AIs solve tasks they’re usually much faster than humans.
Time horizon is not precise
When METR says “Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins)”, we mean those error bars. They were generated via bootstrapping, so if we randomly subsample harder tasks our code would spit out <1h49m 2.5% of the time. I really have no idea whether Claude’s “true” time horizon is 3.5h or 6.5h.
Error bars have historically been a factor of ~2 in each direction, worse with current models like Opus 4.5 as our benchmark begins to saturate.
Because model performance is correlated, error bars for relative comparisons between models are a bit smaller. But it still makes little sense to care about whether a model is just below frontier, 10% above the previous best model, or 20% above.
Time horizon differs between domains by orders of magnitude
The original paper measured it on mostly software and research tasks. Applying the same methodology in a follow-up found that time horizons are fairly similar for math, but 40-100x lower for visual computer use tasks, due to eg poor perception.
Claude 4.5 Sonnet’s real-world coffee-making time horizon is only ~2 minutes
Time horizon does not apply to every task distribution
On SWE-Lancer OpenAI observed that a task’s monetary value (which should be a decent proxy for engineer-hours) doesn’t correlate with a model’s success rate. I still don’t know why this is.
Benchmark vs real-world task distribution
We’re making tasks just ahead of what we expect future models to be able to do, and benchmark construction has many design choices.
We try to make tasks representative of the real world, but as in any benchmark, there are inherent tradeoffs between realism, diversity, fixed costs (implementation), and variable costs (ease of running the benchmark). Inspect has made this easier but there will obviously be factors that cause our benchmarks to favor or disfavor models.
Because anything automatically gradable can be an RL environment, and models are extensively trained using RLVR [1], making gradable tasks that don’t overestimate real-world performance at all essentially means making more realistic RLVR settings than labs, which is hard.
Figure 1: What it feels like making benchmarks before frontier models saturate them
Our benchmarks differ from the real world in many ways, some of which are discussed in the original paper.
Low vs high context (low-context tasks are isolated and don’t require prior knowledge about a codebase)
Well-defined vs poorly defined
“Messy” vs non-messy tasks (see section 6.2 of original paper)
Different conventions around human baseline times could affect time horizon by >1.25x.
I think we made reasonable choices, but there were certainly judgement calls here– the most important thing was to be consistent.
Baseliner skill level: Our baseliner pool was “skilled professionals in software engineering, machine learning, and cybersecurity”, but top engineers, e.g. lab employees, would be faster.
We didn’t incorporate failed baselines into time estimates because baseliners often failed for non-relevant reasons. If we used survival analysis to interpret an X-hour failed baseline as information that the task takes >X hours, we would increase measured task lengths.
When a task had multiple successful baselines we aggregated these using the geometric mean. Baseline times have high variance, so using the arithmetic mean would increase averages by ~25%.
A 50% time horizon of X hours does not mean we can delegate tasks under X hours to AIs.
Some (reliability-critical and poorly verifiable) tasks require 98%+ success probabilities to be worth automating
Doubling the time horizon does not double the degree of automation. Even if the AI requires half as many human interventions, it will probably fail in more complex ways requiring more human labor per intervention.
To convert time horizons to research speedup, we need to measure how much time a human spends prompting AIs, waiting for generations, checking AI output, writing code manually, etc. when doing an X hour task assisted by an AI with time horizon Y hours. Then we plug this into the uplift equation. This process is nontrivial and requires a much richer data source like Cursor logs or screen recordings.
20% and 80% time horizons are kind of fake because there aren’t enough parameters to fit them separately.
We fit a two-parameter logistic model which doesn’t fit the top and bottom of the success curve separately, so improving performance on 20% horizon tasks can lower 80% horizon.
It would be better to use some kind of spline with logit link and monotonicity constraint. The reasons we haven’t done this yet: (a) 80% time horizon was kind of an afterthought/robustness check, (b) we wanted our methods to be easily understandable, (c) there aren’t enough tasks to fit more than a couple more parameters, and (d) anything more sophisticated than logistic regression would take longer to run, and we do something like 300,000 logistic fits (mostly for bootstrapped confidence intervals) to reproduce the pipeline. I do recommend doing this for anyone who wants to measure higher quantiles and has a large enough benchmark to do so meaningfully.
Time horizons at 99%+ reliability levels cannot be fit at all without much larger and higher-quality benchmarks.
Measuring 99% time horizons would require ~300 highly diverse tasks in each time bucket. If the tasks are not highly diverse and realistic, we could fail to sample the type of task that would trip up the AI in actual use.
The tasks also need <<1% label noise. If they’re broken/unfair/have label noise, the benchmark could saturate at 98% and we would estimate the 99% time horizon of every model to be zero.
Speculating about the effects of a months- or years-long time horizon is fraught.
The distribution of tasks from which the suite is drawn from is not super well-defined, and so reasonable different extrapolations could get quite different time-horizon trends.
One example: all of the tasks in METR-HRS are self-contained, whereas most months-long tasks humans do require collaboration.
If an AI has a 3-year time horizon, does this mean an AI can competently substitute for a human for a 3-year long project with the same level of feedback from a manager, or be able to do the human’s job completely independently? We have no tasks involving multi-turn interaction with a human so there is no right answer.
There is a good argument that AGI would have an infinite time horizon and so time horizon will eventually start growing superexponentially. However, the AI Futures timelines model is highly sensitive to exactly how superexponential future time horizon growth will be, which we have little data on. This parameter, “Doubling Difficulty Growth Factor”, can change the date of the first Automated Coder AI between 2028 and 2050.
Despite these limitations, what conclusions do I still stand by?
The most important numbers to estimate were the slope of the long-run trend (one doubling every 6-7 months) and a linear extrapolation of this trend predicting when AIs would reach 1 month / 167 working hours time horizon (2030), not the exact time horizon of any particular model. I think the paper did well here.
Throughout the project we did the least work we could to establish a sufficiently robust result, because task construction and baselining were both super expensive. As a result, the data are insufficient to do some secondary and subset analyses. I still think it’s fine but have increasing worries as the benchmark nears saturation.
Without SWAA the error bars are super wide, and SWAA is lower quality than some easy (non-software) benchmarks like GSM8k. This might seem worrying, but it’s fine because it doesn’t actually matter for the result whether GPT-2’s time horizon is 0.5 seconds or 3 seconds; the slope of the trend is pretty similar. All that matters is that we can estimate it at all with a benchmark that isn’t super biased.
Some tasks have time estimates rather than actual human baselines, and the tasks that do have baselines have few of them. This is statistically ok because in our sensitivity analysis, adding IID baseline noise had minimal impact on the results, and the range of task lengths (spanning roughly a factor of 10,000) means that even baselining error correlated with task length wouldn’t affect the doubling time much.
However, Tom Cunningham points out that most of the longer tasks don’t have baselines, so if we’re systematically over/under-estimating the length of long tasks we could be misjudging the degree of acceleration in 2025.
The paper had a small number of tasks (only ~170) because we prioritize quality over quantity. The dataset size was originally fine but is now becoming a problem as we lack longer, 2h+ tasks to evaluate future models.
I think we’re planning to update the task suite soon to include most of the HCAST tasks (the original paper had only a subset) plus some new tasks. Beyond this, we have various plans to continue measuring AI capabilities, both through benchmarks and other means like RCTs.
[1] see eg DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
No background, but it’s plausible to me that they actively prefer imperfect alignment because companies that care about alignment will tend to be woke, moralizing, or opposed to authoritarianism.
In the specific hedonist vs Christian case, aren’t there two obvious compromises?
“One trillion year reign of the CEV of Jesus of Nazareth over the multiverse”
“Entire mass of universe converted to nervous tissue experiencing euphoric union with God in His loving grace”
There are three cost sources: materials, design, and manufacturing. I claim that because phones are small and expensive to design it’s just not worth it.
Materials: Phones have very high $/kg already. At $5,000/kg, you could launch them to space and only double the cost. So unless there’s a better battery design made of pure gold, phones can already afford basically whatever materials are optimal. Even a $4 million hypercar costs less per kg than an iPhone Pro.
IP/design/software is zero marginal cost. It doesn’t make sense for companies to spend $10 billion designing an OS for its premium users only. Better to segment on hardware.
Manufacturing: Assembly just isn’t that expensive. As for components, it’s impossible to make a phone with chips more than one process node ahead than average. So you get $500 phones that are one node behind, and $1500 phones that are at the frontier and 20% faster, and that’s the limit.
I don’t really think progress being fast is quite sufficient to explain things. If every tripling in price could get you a 20% better phone, many people would pay for an 81x more expensive, 107% better phone just like they do for private jets. If it became obsolete in a year, the billionaire would have their phone guy replace it every six months. I think it’s because a phone, or gmail, starlink, etc. has super high fixed costs, and the smaller market for luxury goods generally means the design will be worse. For food or furniture better quality ingredients/materials can more than make up for it, but consumer electronics can only get slightly better components because most already designed things are within the cost/kg budget of a $1k phone.
Everything more expensive per kg than phones has an interesting way of funding their design costs without scale:
Military tech like fighter jets, radar systems: Cost mainly from r&d. The military demands highest performance and is a large market, spending 2-5% of GDP on fairly small volume
Pharmaceuticals: Cost mainly from r&d. Either large scale or literally lifesaving for rich people
Luxury watches: Practical benefits of a luxury watch are nil but people buy them anyway for status and wonder, essentially paying watchmakers to arrange titanium and rubies into ever more elaborate shapes that happen to tell time.
So my prediction would be that the luxury smartphone business only starts up when lots of rich people have different problems from the average consumer that need a custom device (security maybe?) or subscribe to a status game that results in the phone equivalent of luxury watches.
Money can be exchanged for goods and services in under a 10-year timeframe, including AI hardware and talent. To come out ahead from trade they just need to invest more of their surplus in AI than the US does.
The #1 issue for the Chinese public is the economy, just like in the US. This includes people with influence on CCP decision-making.
Seems unlikely to me this would be in their interest.
Regarding chips, US just started selling them H200s, and if they invade Taiwan they probably destroy TSMC and immediately lose their main source of compute. The majority of the semiconductor industry outside Taiwan would still not be friendly to China.
They also lose 50% of their trade even if the US continues trading with them, which tanks their GDP by something like 10% [1]. If the US sanctions them too, it would be 60-70% trade reduction.
Militarily, they are also building up faster than the US, and if they invade Japan/SK will militarize and might get nuclear weapons.
If they really believed in RSI they would do diplomacy to get more compute and just invest in their AI industry.
[1] Claude thinks the likely outcome is this:
Lost trade: $2.4-2.8 trillion (US, EU, Japan, UK, Canada, Australia, possibly South Korea/Taiwan)
Continuing trade: $2.0-2.4 trillion (Russia, most ASEAN, Brazil, India, Middle East, Africa)
Net trade: $2.0-2.4T vs previous $6.2T = 60-70% reduction
Sure, then just add that to the disclaimer. “I may omit claims that are risky, unpopular, may be easily misinterpreted, require lots of words to justify, etc, but aim to not outright lie for such reasons”
It seems super costly for many public intellectuals to say all of their beliefs, and for reasons that other commenters have pointed out, giving an epistemic status might not help. What’s wrong with a blanket disclaimer like “Assume that all of my claims without an epistemic status are optimized to improve discourse on the margin, rather than to convey a complete picture of my all-things-considered beliefs”?
I think IBKR might let Americans trade HKEX index options out to 2030 but it would probably be a hassle. Otherwise there are options on FXI, a Chinese large-cap ETF, which look less liquid than SPX options and only go out 2 years with an IV of 32% or so. I don’t think FXI is worth it because China isn’t in a position to get AGI in the next 2 years if the US doesn’t.
(in the usual units, this means that the plot of log(2025-FLOP per FLOP) vs log(researcher-hours) is a straight line with slope .) A plot that curves downward or “hits a wall” seems like evidence against this model’s applicability to the data.
Note there are no log-log plots in the data. They’re performance vs LoC and log(performance) vs LoC, and same for stars. I don’t think we’re at an absolute ceiling since two more improvements came out in the past week, they’ve just gotten smaller and taken more code to implement.
I need to think about this algorithmic progress being 10x/year thing. It feels like some assumptions are violated with how much the data seem to give inconsistent answers, maybe there’s a prospective vs retrospective difference. Or do you think progress has just sped up in the past couple of years?
After @Daniel Kokotajlo invited me to the AI Futures office I ended up talking to Eli and Alex for about an hour, and feel like I have a decent understanding of the model:
Compute and effective compute
Actual compute is stock of compute at time t
Effective compute is used as the main measure of AI capabilities. It is defined as the “amount of training compute we’d need to train models as performant as the frontier models at time using the training process of the present-day”.
Compute is allocated as fixed percentages between training, experiments, and automated coders
Effective labor
The % of tasks automatable is a logistic function of log effective compute
Once a task can be automated, it will still get more efficient over time by a multiplier
is zero for non-automated tasks. When effective compute reaches the level required to automate it, it increases as a power law .
Human coding labor and automation compute are optimally allocated between tasks
Overall coding labor for task i is the sum of human and AI labor
Aggregate coding labor is CES between the labor applied to all different tasks, with low substitutability by default, meaning tasks only substitute slightly for each other
Finally, serial coding labor , indicating diminishing returns of about to adding more labor in parallel
“Experiment throughput” is CES between serial coding labor and experiment compute
Labor and compute are slight complements (median estimate )
There are also diminishing returns to compute, with where by default
Research taste
Human research taste is lognormally distributed with median researchers defined as 1x taste and 99.9th percentile (+3.1SD) researchers assumed to have 3.70x research taste
An Automated Coder–level AI has research taste
AI research taste increases as a power law in effective compute (AI “research taste IQ” is standard deviations above the human median, which is then passed through an exponential to get research taste)
AIs replace whatever humans they’re better than. The aggregate research taste of the company is the mean of all remaining researchers. This means it initially increases slowly as AIs replace the worst researchers, then speeds up as everyone starts using the AIs’ research taste which keeps improving.
Research effort RE(t) = research taste * experiment throughput
Then software efficiency follows the Jones model
is how much harder AI R&D gets as software efficiency advances
Finally this feeds back into effective compute
A taste-only singularity happens when , where = doublings of research taste per doubling in effective compute. This would cause improvements to go faster and faster until approaching physical limits. Eli’s parameter choices give 38% chance of taste-only singularity, but many of the non-singularity samples still get to ASI quickly, with the 50th percentile sample getting from AC to ASI in 5 years.
For various reasons Eli and Daniel’s all-things-considered views have harder takeoff than the model predicts, with Eli’s median for AC → ASI 2 years, and Daniel’s median 1.5 years.
Time to AC is very sensitive to how superexponential time horizon growth is, and also to
The present doubling time
Time horizon for automated coder
Time from AC to ASI is very sensitive to the “automated research taste slope” : how much “research IQ” AIs gain per doubling of effective training compute. But many other factors could slow down the AC-to-ASI duration to >6.5 years:
Median-to-top-human jumps above SAR needed to reach TED-AI
The software efficiency growth rate in 2024
Median to 99.9th% human research taste multiplier
Slowdown from 10x less experiment compute
Research progress rate in the limit of infinite coding labor: mostly because it’s highly uncertain (their 90% CI is 2.0-201)
Automated research taste of an AC
(not necessarily that I disagree, just need to think about it more)
Effective compute vs time horizon: how do all the assumptions look when we eliminate time horizon from the model and use other methods to model effective compute growth? I’m sketched out by the huge error bars on time horizon superexpontiality → time to AC
Ryan thinks >70% of code at Anthropic was written by AIs already in October 2025 but it’s mostly low-value code. Code varies dramatically in value, and AIs can expand the number and type of low-value tasks done rather than just substituting for humans. This may be a separate effect from AIs doing extra work on tasks that can be automated, which is not tracked by the model.
It might be that coding ability and research taste are two ends of a continuous spectrum from small-scale to large-scale tasks.
Research taste:
Someone really needs to do experiments on this, it’s possible now. David Rein and I are actively thinking about it
Is human research taste modeled correctly? Eg it seems likely to me that the 0.3% of top humans add more than 0.3%*3.7x to the “aggregate research taste” of a lab because they can set research directions. There are maybe more faithful ways to model it; all the ones Eli mentioned seemed far more complicated.
Is modeling AI research taste as exponential in human standard deviations valid? I have no idea whether someone 9 standard deviations above the human median would be able to find 3.7^(9/3) = 50x better research ideas or not
Is CES valid for experiment throughput at these extreme values of labor and compute? It seems like a superhuman AI researcher might learn to run experiments more efficiently, decreasing the compute required for each experiment. The estimates for experiment throughput parameters were all about humans getting 10x compute, infinite labor, etc. Or, they could coordinate better (especially with all the human ex-coders to help them), and decrease the parallelization penalties for labor and/or compute. I’m not sure if this would be different from adjusting research taste.
Yes buying volatility is intentional. If I thought more I would fine tune things, but it’s not so important to gain 20% when spy goes up 10% because that probably doesn’t mean loss of your future salary.
I should clarify that I mean closer to 0.2 years of salary than 10% of whatever your net worth is, if you just want to hedge your automation risk, given the potential loss is a fixed ~10 years of salary. On second thought it should maybe be less than this due to various factors. To give a proper recommendation I would have to do some math, which I might do if this becomes a longform.
Thoughts in no particular order:
Kudos for what seems to be lots of thoughtful work incorporated into this model.
There are a lot of parameters. Maybe this is necessary but it’s a bit overwhelming and requires me to trust whoever estimated the parameters, as well as the modeling choices.
I couldn’t find a block of equations that represents the whole model, or an index of variables in one place, and it’s difficult to go between math and exposition especially when the equations are hidden in dropdowns, so I still feel like I don’t have a good picture. I had Claude do this and read through it, and it looks reasonable but some parts are still not defined in Claude’s summary, I think because the whole page is rendered in javascript and it couldn’t access it. I would love to visit the AI Futures office again to understand the model better.
I find the use of time horizon as such a crucial intermediate variable confusing and am scared of potential assumptions around it.
Time horizon is underdefined on years long tasks. I know I talked to the AI Futures team about what you now wrote up as METR-HRS-Extended to operationalize it, but it’s unclear what a 3-year time horizon really means (when extrapolating the trend with superexponential adjustment) given factors like the increased number of details and interaction with longer tasks. Does the trend mean that in X years, an AI will competently substitute for a human for a 3-year long project with the same level of feedback from a manager, or be able to do the human’s job with less feedback?
The function that relates time horizon and research speedup in the real world is very unclear. I’m trying to collect data on this and it’s highly nontrivial to model and interpret[1] so I’m skeptical of any use of the time horizon trend to predict uplift that doesn’t either have a simple, robust model or something validated by experiment.
The description implies that time horizon is only used in the first phase (pre-AC), but I don’t understand how it’s used. Humans and AIs will probably go from their current state to perfectly substitutable in some kind of continuous progression and I couldn’t find the formula for this. Also when I changed the “How much easier/harder each coding time horizon doubling gets” parameter by small amounts, the forecasted time from AC to ASI changes significantly (2.7 years at 0.90, over 4 years for 1.00), so it looks like stages 2 and 3 are affected as well.
It seems to me like a time horizon of 3 years or 125 years is complete overkill for automation of enough coding for the bottleneck to shift to experiments and research taste.
Why not assume that compute is allocated optimally between experiment, inference, etc. rather than assuming things about the behavior of AI companies?
I wish the interface were faster to update, closer to 100ms than 1s to update, but this isn’t a big deal. I can believe it’s hard to speed up code that integrates these differential equations many times per user interaction.
Eg looking at transcripts to determine where humans are spending their time when they give Cursor tasks of a certain length
Maybe it could raise interest rates, but I also have TLT (long dated treasury bonds) put options for this possibility. TLT has a duration of ~16 years, so if the interest rate goes from 4.9% to 15%, TLT will crash by ~65%. Also, when full automation actually happens, stocks will go up even if they went down slightly due to expectations of automation.
Most people should buy long-dated call options:
If you’re early career, have a stable job, and have more than ~3 months of savings but not enough to retire, then lifecycle investing already recommends investing very aggressively with leverage (e.g. 2x the S&P 500). This is not speculation, it decreases risk due to diversifying over time. The idea is as a 30yo most of your wealth is still in your future wages, which are only weakly correlated with the stock market, so 2x leverage on your relatively small savings now might still mean under 1x leverage on your effective lifetime portfolio.
In 2026, most of your long-term financial risk comes from your job being automated, which will plausibly happen in the next 5 years. If this happens, your salary will go to zero while the S&P 500 will probably at least double (assuming no AI takeover) [1]. If automation takes 20 years, the present value of your future income is ~10 years of salary. This makes exposure to the market (beta) extremely important. If you have 2 years of salary saved, the required leverage just to break even whether automation takes 5 years or 20 is something like 4x.
However, we can do better; betting that a price movement will happen in a defined time frame is exactly what options are for. We want them to be as long-dated as possible because the market basically expects the economy to be normal forever. So what to actually buy?
This contract will profit if SPY (which tracks the S&P 500) goes up by more than ~50% in the next 2.5 years. The implied volatility is only 15.4%, which means you don’t lose much in expectation if historical trends hold, and the spreads are tight enough that it’s easy to buy. Most likely it will expire worthless, but if SPY somehow doubles it will return 80x [2]. [edit: Dec 2028 options are now out, which seem better]
To buy them, you need to get options approval from your brokerage (usually level 1 or 2 for long call options), then search for SPY options, select a call option with a strike price of 30-50% above the current price and an expiration date as far in the future as possible, check that the bid-ask spread is acceptably narrow (<10% or so), and submit a market order while the market is open (9:30am-4pm ET weekdays).
What to do with them afterwards is out of scope of this post, but there’s no need to even look at them more than 1-2 times per year.
These are SPX index options, which are somewhat tricky to buy [3] but are available 5-6 years out and are therefore far better if you put significant probability on automation between 2.5 and 6 years from now.
Currently I have 30% of my net worth in 2-year SPY options, 30% in semiconductors and the rest in Wealthfront for tax loss harvesting. This is partly for speculation, but it seems reasonable for most people with 2 years of savings to have 10% of their net worth in SPY options or 20% in SPX options [4] for hedging purposes alone.
[1] The S&P 500 is 80% of the US stock market, so probably captures most of the gains of automation. This would fail if most of the gains go to private companies, or if the economy is automated but still only grows at like 10%/year
[2] at current prices, SPY is 683, so if it doubles to 1366 the option will be worth (1366-1000)/4.47 = 81.8x
[3] Spreads are super wide so to get better prices you would want to make a limit order at midpoint and increase the price over a few days. Also not every broker will let you buy them (Fidelity and IBKR work but not Schwab) and 40% of all gains are taxed as short term, unless you buy them in an IRA account.
[4] Higher for SPX options because they’re longer dated, so you don’t need to roll them forward as frequently.
They do have a GPT-2 medium track, which has improved by 20.0x from 5.8 hours to 17.35 minutes. My guess is the speedup isn’t greater because the scale is barely larger (350M, which is only a 2.8x increase vs the ~1000x to current frontier models) and less effort has been applied. Nevertheless someone should try applying improvements from other open-source models to this track and see if they can get the ratio to >23x.
I didn’t really define software intelligence explosion, but had something in mind like “self-reinforcing gains from automated research causing capabilities gains in 6 months to be faster than the labor/compute scaleup-driven gains in the 3 years from 2023-2025”, and then question I was targeting with the second part was “After the initial speed-up from ASARA, does the pace of progress accelerate or decelerate as AI progress feeds back on itself?”
A 23.5x improvement alone seems like it would qualify as a major explosion if it happened in a short enough period in time
Seems about true. I claim that the nanogpt speedrun suggests this is only likely if future AI labor is exponentially faster at doing research than current humans, with many caveats of course, and I don’t really have an opinion on that.
We already know that there is of course a fundamental limit to how fast you can make an algorithm, so the question is always “how close to optimal are current algorithms”. It should be our very strong prior that any small subset of frontier model training will hit diminishing returns much quicker than the complete whole.
This is not as small a subset of training as you might think. The 53 optimizations in the nanogpt speedrun touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version. The main two things that limit a comparison to frontier AI are scale and data improvement. It’s known there are many tricks that work at large scale but not at small scale. If you believe the initial 15x speedup is analogous and that the larger scale gives you a faster, then maybe we get something like a 100x speedup atop our current algorithms? But I don’t really believe that the original nanoGPT, which was a 300-line repo written to be readable rather than efficient [1], is analogous to our current state. If there were a bunch of low-hanging fruit that could give strongly superlinear returns, we would see 3x/year efficiency gains with small increases in labor or compute over time, but we actually require 5x/year compute increase and ~3x per year labor increase.
A software intelligence explosion is completely possible with linear speedups in cumulative effort. Indeed, it is possible with sublinear increases in cumulative effort.
Agree I was being a bit sloppy here. The derivative being infinite is not relevant in Davidson’s model or my mind, it’s whether the pace of progress accelerates or decelerates. It could still be very fast as it decelerates, but I’m not really thinking in enough detail to model these borderline cases, so maybe we should think of the threshold for very fast software-driven progress as r > 0.75 or something rather than r > 1.
Diminishing returns in the NanoGPT speedrun:
To determine whether we’re heading for a software intelligence explosion, one key variable is how much harder algorithmic improvement gets over time. Luckily someone made the NanoGPT speedrun, a repo where people try to minimize the amount of time on 8x H100s required to train GPT-2 124M down to 3.28 loss. The record has improved from 45 minutes in mid-2024 down to 1.92 minutes today, a 23.5x speedup. This does not give the whole picture—the bulk of my uncertainty is in other variables—but given this is existing data it’s worth looking at.
I only spent a couple of hours looking at the data [3], but there seem to be sharply diminishing marginal returns, which is some evidence against a software-only singularity.
At first improvements were easy to make without increasing lines of code much, but then improvements became small and LoC required became larger and larger with increasingly small improvements, which means very strong diminishing returns—speedup is actually sublinear in lines of code. This could be an artifact related to the very large elbow early on, but I mostly believe it.
If we instead look at number of stars as a proxy for amount of attention on the project [4], there are no diminishing returns. The data basically suggest speedup is linear in effort [1], which is consistent with a world where 3x/year increases in labor and compute are required to sustain the historical trend of ~3x/year algorithmic speedups observed by Epoch. However, this still points against a software intelligence explosion, which would require superlinear speedups for linear increases in cumulative effort.
Given that the speedup-vs-stars and speedup-vs-improvement-# graphs are linear but speedup-vs-LoC is sublinear, our guess should be that returns to research output are somewhat sublinear. In the language of Davidson’s semi-endogenous growth model, this means [2]. Of course there are massive caveats about extrapolation to future models.
In Davidson’s model, the requirement for a software intelligence explosion after research is automated is , where represents inefficiency of parallel work and is the elasticity of research output to cognitive labor at a fixed compute budget. If , this mathematically means and we don’t get an SIE.
So I think an SIE will only happen if one or more of the below is true:
Cognitive quality of future AIs being so high that one hour of AIs researching is equivalent to exponentially large quantities of human researcher hours, even if they don’t train much more efficiently to the same capability level. This is the most important question to answer and something I hope METR does experiments on in Q1.
Some other difference between nanogpt and frontier AI research, e.g. maybe research ability is easier to improve than base model loss
Once we get AGI, something other than human labor, AI labor, or compute scales exponentially on a much faster timescale, e.g. training data
A paradigm shift to a totally different architecture that wouldn’t be captured in a dataset only 1.5 years long
AIs with 2x the efficiency cause research to happen more than 2x faster, despite outweigh the diminishing returns from parallel work AND being bottlenecked by compute, making . I can’t think of a way this could happen, and to get an SIE it would have to be even larger than 1.
[1]: This was previously observed in a tweet from Epoch in February but now we have about twice the data.
[2]: would mean exponential improvements, while implies linear improvement over time at constant labor/compute. So means improvements are actually slower than linear.
[3]: A few minutes ideating, almost an hour writing a prompt for Claude 4.5 Opus, then 30 minutes making graphs and such.
[4]: It’s unclear whether to say that stars represent instantaneous effort or total cumulative effort on the project. If we interpret it as instantaneous effort, then we would see diminishing returns. Also it’s unclear whether stars are measuring or ; if it might imply slightly increasing returns.
Inducing sexual arousal seems like a better equilibrium, as long as everyone consents. It has positive valence roughly proportional to ΔHR, solves gender ratio problems and incentivizes people to learn effective flirting.
Disagree; LW is not an academic journal for rationality either. The best content should go in the top 50 whether it’s satire or not.