After @Daniel Kokotajloinvited me to the AI Futures office I ended up talking to Eli and Alex for about an hour, and feel like I have a decent understanding of the model:
Summary of the AI Futures Model
Compute and effective compute
Actual compute C(t) is stock of compute at time t
Effective compute E(t):=C(t)⋅software efficiency is used as the main measure of AI capabilities. It is defined as the “amount of training compute we’d need to train models as performant as the frontier models at time t using the training process of the present-day”.
E(t) is estimated by relating it to time horizon.
Compute is allocated as fixed percentages between training, experiments, and automated coders
Effective labor
The % of tasks automatable is a logistic function of log effective compute E(t)
Once a task can be automated, it will still get more efficient over time by a multiplier ηi(t)
ηi(t) is zero for non-automated tasks. When effective compute reaches the level Ei required to automate it, it increases as a power law ηinit ⋅(E(t)Ei)ηslope .
Human coding labor LC,H(t) and automation compute Caut,i are optimally allocated between tasks
Overall coding labor for task i is the sum of human and AI labor Gi=LC,H,i+ηi⋅Caut,i
Aggregate coding labor LC(t) is CES between the labor applied to all different tasks, with low substitutability ρc=−2 by default, meaning tasks only substitute slightly for each other
Finally, serial coding labor ~LC(t)=LC(t)λ, indicating diminishing returns of about λ=0.5 to adding more labor in parallel
“Experiment throughput” X(t) is CES between serial coding labor and experiment compute
X(t)=(α~Cxpm(t)ρx+(1−α)~LC(t)ρx)1/ρx,0<α<1,ρx<0
Labor and compute are slight complements (median estimate ρx=−0.155)
There are also diminishing returns to compute, with ~Cxpm=Cζxpm where ζ=0.65 by default
Research taste T(t)
Human research taste is lognormally distributed with median researchers defined as 1x taste and 99.9th percentile (+3.1SD) researchers assumed to have 3.70x research taste
An Automated Coder–level AI has research taste TAC
AI research taste increases as a power law in effective compute (AI “research taste IQ” is Trate⋅logE(t)+const standard deviations above the human median, which is then passed through an exponential to get research taste)
AIs replace whatever humans they’re better than. The aggregate research taste of the company is the mean of all remaining researchers. This means it initially increases slowly as AIs replace the worst researchers, then speeds up as everyone starts using the AIs’ research taste which keeps improving.
Research effort RE(t) = research taste * experiment throughput
Then software efficiency S(t) follows the Jones model ˙S=S1−βRE(t)
β is how much harder AI R&D gets as software efficiency advances
Finally this feeds back into effective compute E(t)=Ctrain(t)S(t)
A taste-only singularity happens when m>β, where m = doublings of research taste per doubling in effective compute. This would cause improvements to go faster and faster until approaching physical limits. Eli’s parameter choices give 38% chance of taste-only singularity, but many of the non-singularity samples still get to ASI quickly, with the 50th percentile sample getting from AC to ASI in 5 years.
For various reasons Eli and Daniel’s all-things-considered views have harder takeoff than the model predicts, with Eli’s median for AC → ASI 2 years, and Daniel’s median 1.5 years.
Time to AC is very sensitive to how superexponential time horizon growth is, and also to
The present doubling time
Time horizon for automated coder
Time from AC to ASI is very sensitive to the “automated research taste slope” Trate: how much “research IQ” AIs gain per doubling of effective training compute. But many other factors could slow down the AC-to-ASI duration to >6.5 years:
Median-to-top-human jumps above SAR needed to reach TED-AI
The software efficiency growth rate in 2024
Median to 99.9th% human research taste multiplier
Slowdown from 10x less experiment compute
Research progress rate in the limit of infinite coding labor: mostly because it’s highly uncertain (their 90% CI is 2.0-201)
Automated research taste of an AC
Biggest uncertainties to track
(not necessarily that I disagree, just need to think about it more)
Effective compute vs time horizon: how do all the assumptions look when we eliminate time horizon from the model and use other methods to model effective compute growth? I’m sketched out by the huge error bars on time horizon superexpontiality → time to AC
Ryan thinks >70% of code at Anthropic was written by AIs already in October 2025 but it’s mostly low-value code. Code varies dramatically in value, and AIs can expand the number and type of low-value tasks done rather than just substituting for humans. This may be a separate effect from AIs doing extra work on tasks that can be automated, which is not tracked by the model.
It might be that coding ability and research taste are two ends of a continuous spectrum from small-scale to large-scale tasks.
Research taste:
Someone really needs to do experiments on this, it’s possible now. David Rein and I are actively thinking about it
Is human research taste modeled correctly? Eg it seems likely to me that the 0.3% of top humans add more than 0.3%*3.7x to the “aggregate research taste” of a lab because they can set research directions. There are maybe more faithful ways to model it; all the ones Eli mentioned seemed far more complicated.
Is modeling AI research taste as exponential in human standard deviations valid? I have no idea whether someone 9 standard deviations above the human median would be able to find 3.7^(9/3) = 50x better research ideas or not
Is CES valid for experiment throughput at these extreme values of labor and compute? It seems like a superhuman AI researcher might learn to run experiments more efficiently, decreasing the compute required for each experiment. The estimates for experiment throughput parameters were all about humans getting 10x compute, infinite labor, etc. Or, they could coordinate better (especially with all the human ex-coders to help them), and decrease the parallelization penalties for labor and/or compute. I’m not sure if this would be different from adjusting research taste.
Thanks for writing this up! Excited about research taste experiments.
Is human research taste modeled correctly? Eg it seems likely to me that the 0.3% of top humans add more than 0.3%*3.7x to the “aggregate research taste” of a lab because they can set research directions. There are maybe more faithful ways to model it; all the ones Eli mentioned seemed far more complicated.
A minimal change would be to change the aggregation from mean to something else, we were going to do this but didn’t get to it in time. But yeah to do it more faithfully I think would be pretty complicated because you have to model experiment compute budgets for each human/AI. Note also that we aren’t really modeling human/AI taste complementarity.
Or, they could coordinate better (especially with all the human ex-coders to help them), and decrease the parallelization penalties for labor and/or compute
Agree that ideally there would at least be different penalties for AIs vs. humans doing the labor.
Is modeling AI research taste as exponential in human standard deviations valid? I have no idea whether someone 9 standard deviations above the human median would be able to find 3.7^(9/3) = 50x better research ideas or not.
Note that because of limits (which weren’t in your summary) the model is in practice subexponential, but exponential is generally a good approximation for the model around the human range. See here (4.2.2) for an explanation of taste limits.
Regarding whether it’s a good approximation in the human range, we have some n=12 survey results on this here, obviously take with a huge grain of salt, but extracted from these results the ratio of (taste per SD between the 90th percentile and top researchers) and (taste per SD between 50th percentile and top) appears to be fairly close to 1: 1.01 median if assuming a population of 1000 researchers, and 0.95 median if assuming a population of 100.
Research is exploration: trying stuff to gain information about what happens and what works
You’re planning experiments, the unit of that exploration
This planning benefits from heuristics that generate, refine, and select better experiment plans: that’s taste
(As well as these heuristics, you can just plan for (effectively) longer if you have more thinkspeed, but I tentatively believe that falls off sharply per unit, until you get more feedback from reality, even when it’s serial thinkspeed)
How do you get these heuristics? By necessity, they’re partially-generalising models based on experience of experiments
(That experience can be indirect, in the form of textbooks or expert interviews etc.)
(But the key point is that taste isn’t just a generic capacity or quantity you have; it comes from looking at the world, specifically getting a feel for high value-of-information interactions)
So experimental throughput is crucial, as is sample efficiency (at improving your taste models)
Taste is a stock; it depreciates due to movement of the frontier of the known
You learn stuff from your experiments, you enter (more or less) different regimes, your heuristics are that bit further from their solid base of generalisation
How fast this deprecation happens is therefore of great interest i.e. how generalising is research taste in a given domain?
(This deprecation also means that the one-time boost to taste stock by slurping up all textbooks and expert interviews etc. is limited, but it’s not clear how limited)
After @Daniel Kokotajlo invited me to the AI Futures office I ended up talking to Eli and Alex for about an hour, and feel like I have a decent understanding of the model:
Summary of the AI Futures Model
Compute and effective compute
Actual compute C(t) is stock of compute at time t
Effective compute E(t):=C(t)⋅software efficiency is used as the main measure of AI capabilities. It is defined as the “amount of training compute we’d need to train models as performant as the frontier models at time t using the training process of the present-day”.
E(t) is estimated by relating it to time horizon.
Compute is allocated as fixed percentages between training, experiments, and automated coders
Effective labor
The % of tasks automatable is a logistic function of log effective compute E(t)
Once a task can be automated, it will still get more efficient over time by a multiplier ηi(t)
ηi(t) is zero for non-automated tasks. When effective compute reaches the level Ei required to automate it, it increases as a power law ηinit ⋅(E(t)Ei)ηslope .
Human coding labor LC,H(t) and automation compute Caut,i are optimally allocated between tasks
Overall coding labor for task i is the sum of human and AI labor Gi=LC,H,i+ηi⋅Caut,i
Aggregate coding labor LC(t) is CES between the labor applied to all different tasks, with low substitutability ρc=−2 by default, meaning tasks only substitute slightly for each other
Lc(t)=(∫10Gi(t)ρcdi)1/ρc=(∫10(Lc,H,i(t)+ηi(t)⋅Caut,i(t))ρcdi)1/ρc.
Finally, serial coding labor ~LC(t)=LC(t)λ, indicating diminishing returns of about λ=0.5 to adding more labor in parallel
“Experiment throughput” X(t) is CES between serial coding labor and experiment compute
X(t)=(α~Cxpm(t)ρx+(1−α)~LC(t)ρx)1/ρx,0<α<1,ρx<0
Labor and compute are slight complements (median estimate ρx=−0.155)
There are also diminishing returns to compute, with ~Cxpm=Cζxpm where ζ=0.65 by default
Research taste T(t)
Human research taste is lognormally distributed with median researchers defined as 1x taste and 99.9th percentile (+3.1SD) researchers assumed to have 3.70x research taste
An Automated Coder–level AI has research taste TAC
AI research taste increases as a power law in effective compute (AI “research taste IQ” is Trate⋅logE(t)+const standard deviations above the human median, which is then passed through an exponential to get research taste)
AIs replace whatever humans they’re better than. The aggregate research taste of the company is the mean of all remaining researchers. This means it initially increases slowly as AIs replace the worst researchers, then speeds up as everyone starts using the AIs’ research taste which keeps improving.
Research effort RE(t) = research taste * experiment throughput
Then software efficiency S(t) follows the Jones model ˙S=S1−βRE(t)
β is how much harder AI R&D gets as software efficiency advances
Finally this feeds back into effective compute E(t)=Ctrain(t)S(t)
A taste-only singularity happens when m>β, where m = doublings of research taste per doubling in effective compute. This would cause improvements to go faster and faster until approaching physical limits. Eli’s parameter choices give 38% chance of taste-only singularity, but many of the non-singularity samples still get to ASI quickly, with the 50th percentile sample getting from AC to ASI in 5 years.
For various reasons Eli and Daniel’s all-things-considered views have harder takeoff than the model predicts, with Eli’s median for AC → ASI 2 years, and Daniel’s median 1.5 years.
Notes on Sensitivity analysis
Time to AC is very sensitive to how superexponential time horizon growth is, and also to
The present doubling time
Time horizon for automated coder
Time from AC to ASI is very sensitive to the “automated research taste slope” Trate: how much “research IQ” AIs gain per doubling of effective training compute. But many other factors could slow down the AC-to-ASI duration to >6.5 years:
Median-to-top-human jumps above SAR needed to reach TED-AI
The software efficiency growth rate in 2024
Median to 99.9th% human research taste multiplier
Slowdown from 10x less experiment compute
Research progress rate in the limit of infinite coding labor: mostly because it’s highly uncertain (their 90% CI is 2.0-201)
Automated research taste of an AC
Biggest uncertainties to track
(not necessarily that I disagree, just need to think about it more)
Effective compute vs time horizon: how do all the assumptions look when we eliminate time horizon from the model and use other methods to model effective compute growth? I’m sketched out by the huge error bars on time horizon superexpontiality → time to AC
Ryan thinks >70% of code at Anthropic was written by AIs already in October 2025 but it’s mostly low-value code. Code varies dramatically in value, and AIs can expand the number and type of low-value tasks done rather than just substituting for humans. This may be a separate effect from AIs doing extra work on tasks that can be automated, which is not tracked by the model.
It might be that coding ability and research taste are two ends of a continuous spectrum from small-scale to large-scale tasks.
Research taste:
Someone really needs to do experiments on this, it’s possible now. David Rein and I are actively thinking about it
Is human research taste modeled correctly? Eg it seems likely to me that the 0.3% of top humans add more than 0.3%*3.7x to the “aggregate research taste” of a lab because they can set research directions. There are maybe more faithful ways to model it; all the ones Eli mentioned seemed far more complicated.
Is modeling AI research taste as exponential in human standard deviations valid? I have no idea whether someone 9 standard deviations above the human median would be able to find 3.7^(9/3) = 50x better research ideas or not
Is CES valid for experiment throughput at these extreme values of labor and compute? It seems like a superhuman AI researcher might learn to run experiments more efficiently, decreasing the compute required for each experiment. The estimates for experiment throughput parameters were all about humans getting 10x compute, infinite labor, etc. Or, they could coordinate better (especially with all the human ex-coders to help them), and decrease the parallelization penalties for labor and/or compute. I’m not sure if this would be different from adjusting research taste.
Thanks for writing this up! Excited about research taste experiments.
A minimal change would be to change the aggregation from mean to something else, we were going to do this but didn’t get to it in time. But yeah to do it more faithfully I think would be pretty complicated because you have to model experiment compute budgets for each human/AI. Note also that we aren’t really modeling human/AI taste complementarity.
Agree that ideally there would at least be different penalties for AIs vs. humans doing the labor.
Note that because of limits (which weren’t in your summary) the model is in practice subexponential, but exponential is generally a good approximation for the model around the human range. See here (4.2.2) for an explanation of taste limits.
Regarding whether it’s a good approximation in the human range, we have some n=12 survey results on this here, obviously take with a huge grain of salt, but extracted from these results the ratio of (taste per SD between the 90th percentile and top researchers) and (taste per SD between 50th percentile and top) appears to be fairly close to 1: 1.01 median if assuming a population of 1000 researchers, and 0.95 median if assuming a population of 100.
Arguably, there have already been some, see e.g. Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers, The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas and Predicting Empirical AI Research Outcomes with Language Models. I’d interpret the results as: even models from about one year ago, with reasonable scaffolding/fine-tuning, seem already roughly in the range of a PhD student from a top institution on research taste, if not higher, in the ML research domain.
I’ve a simple model of research taste:
Research is exploration: trying stuff to gain information about what happens and what works
You’re planning experiments, the unit of that exploration
This planning benefits from heuristics that generate, refine, and select better experiment plans: that’s taste
(As well as these heuristics, you can just plan for (effectively) longer if you have more thinkspeed, but I tentatively believe that falls off sharply per unit, until you get more feedback from reality, even when it’s serial thinkspeed)
How do you get these heuristics? By necessity, they’re partially-generalising models based on experience of experiments
(That experience can be indirect, in the form of textbooks or expert interviews etc.)
(But the key point is that taste isn’t just a generic capacity or quantity you have; it comes from looking at the world, specifically getting a feel for high value-of-information interactions)
So experimental throughput is crucial, as is sample efficiency (at improving your taste models)
Taste is a stock; it depreciates due to movement of the frontier of the known
You learn stuff from your experiments, you enter (more or less) different regimes, your heuristics are that bit further from their solid base of generalisation
How fast this deprecation happens is therefore of great interest i.e. how generalising is research taste in a given domain?
(This deprecation also means that the one-time boost to taste stock by slurping up all textbooks and expert interviews etc. is limited, but it’s not clear how limited)
There are a bunch of parameters that look important on this view:
how ‘far’ does taste generalise (in the given domain)
or equivalently (and perhaps easier to instrumentalise and estimate?) how fast does it depreciate as the frontier moves?
how fast does the return to extra reasoning for experiment design diminish?
what are sample efficiency scaling laws like? (Does this change for finetuning and in-context sample efficiency and the like?)
do returns to research on effective compute look different form returns to research on sample efficiency?
I expect yes, in part because effective compute improvements are a bit more straightforward to verify