There are a lot of parameters. Maybe this is necessary but it’s a bit overwhelming and requires me to trust whoever estimated the parameters, as well as the modeling choices.
Yep. If you haven’t seen already, we have a basic sensitivity analysis here. Some of the parameters matter much more than others. The five main sliders on the webpage correspond to the most impactful ones (for timelines as well as takeoff.) There are also some checkboxes in the Advanced Parameters section that let you disable certain features to simplify the model.
Regarding form factor / conciseness: thanks for the feedback! Seems like people have widely varying opinions here. Would you prefer a form factor like this to what we currently have? Would you prefer a Big Table of every equation plus a Big Table with every symbol to what we currently have? (by the way, I can’t actually see the file Claude made—maybe it would work if you shared the artifact rather than the conversation?)
Relating time horizon to coding automation speedup: The only purpose of the time horizon in the model is to forecast the effective compute (which we would like to interpret as abstract “capability” or ECI or something) required for the Automated Coder milestone. You could do this in other ways, e.g. using Bio Anchors. (In fact, I would like to do it in other ways for more robustness, but sadly we didn’t have time before the launch.)
We roll out the “human-only” trajectory to translate the current (release date, time horizon) trend into an (effective compute, time horizon) trend, then extrapolate that until the horizon reaches the AC requirement. This tells you the effective compute required for Automated Coder. This is then used to fit a separate automation schedule (which tells you the “fraction of coding tasks automated” at each effective compute level) which gets anchored to 100% at the AC effective compute value. (Another degree of freedom is pinned down to match our estimate of today’s coding uplift). This automation fraction is used in a task-based CES model to compute the aggregate coding labor at each time, which is a pretty standard technique in economics for modeling automation, but not necessarily good. We think a more gears-level model of the delegation / reviewing / etc process of agentic coding would be more accurate, but again didn’t come up with a fully-formed one in time. I’m curious to hear more about the data you’re collecting on this!
Seems useful to talk to us in person about interpretations of time horizon / why 130 years is maybe reasonable. Eli’s rationale for that estimate is written up here.
Why not assume that compute is allocated optimally between experiment, inference, etc. rather than assuming things about the behavior of AI companies?
As with many things in this project, we wish we had more time to look into it, but didn’t prioritize it because we thought it would affect the results less than other things. When I briefly thought about this in the middle of the project, I remember getting confused about what “optimal” should mean. It also seems like it might increase the complexity of solving the model (it could increase the dimension of the system of differential equations, depending on how it’s done.)
Messy digression on ways one might do this
For example, how do you decide how to allocate compute between experiments and training? If your goal is “maximize the effective compute of your frontier model at the end of the year”, the optimal policy is to spend all of your compute on experiments now, then at some specific time switch to all-training. But setting a schedule of “deadlines” like this seems unnatural.
You could also imagine that at each point in time, the lab is spending an amortized “training budget” in H100e equaling
(size of actual frontier training system) x (fraction of each year during which it’s utilized for training production models)
or
(company H100e) x (fraction of H100e in frontier training system) x (fraction of each year when it’s utilized)
which is, assuming one frontier-scale production training run per year (which is maybe reasonable??):
(company H100e) x (fraction of H100e in frontier training system) x (min{1, training run length / 1 year}).
Jointly optimizing the FTS fraction, the training run length, and experiment compute seems like a bit of a mess, since the software efficiency that matters is probably the software efficiency at the beginning of the run, which you already decided at a previous timestep… possibly there’s a nice solution to this. Might think about it more later today.
One thing I agree would be easy and I should probably do is plot the implied MRTS between experiment compute and automation compute over time, i.e. the number of experiment H100e you’d need to gain such that simultaneously losing a single automation H100e doesn’t affect the software efficiency growth rate (or equivalently research effort). Theoretically this should always be 1. I bet it isn’t though.
I wish the interface were faster to update, closer to 100ms than 1s to update, but this isn’t a big deal. I can believe it’s hard to speed up code that integrates these differential equations many times per user interaction.
Very interesting to hear! One main obstacle is that the model is being solved on the server rather than the client, so getting to 100ms is hard. There’s also a tradeoff with time resolution (with very fast takeoffs, the graphs already look a bit piecewise linear.) But I think there is definitely room for optimization in the solver.
Thanks!
Yep. If you haven’t seen already, we have a basic sensitivity analysis here. Some of the parameters matter much more than others. The five main sliders on the webpage correspond to the most impactful ones (for timelines as well as takeoff.) There are also some checkboxes in the Advanced Parameters section that let you disable certain features to simplify the model.
Regarding form factor / conciseness: thanks for the feedback! Seems like people have widely varying opinions here. Would you prefer a form factor like this to what we currently have? Would you prefer a Big Table of every equation plus a Big Table with every symbol to what we currently have? (by the way, I can’t actually see the file Claude made—maybe it would work if you shared the artifact rather than the conversation?)
Relating time horizon to coding automation speedup:
The only purpose of the time horizon in the model is to forecast the effective compute (which we would like to interpret as abstract “capability” or ECI or something) required for the Automated Coder milestone. You could do this in other ways, e.g. using Bio Anchors. (In fact, I would like to do it in other ways for more robustness, but sadly we didn’t have time before the launch.)
We roll out the “human-only” trajectory to translate the current (release date, time horizon) trend into an (effective compute, time horizon) trend, then extrapolate that until the horizon reaches the AC requirement. This tells you the effective compute required for Automated Coder. This is then used to fit a separate automation schedule (which tells you the “fraction of coding tasks automated” at each effective compute level) which gets anchored to 100% at the AC effective compute value. (Another degree of freedom is pinned down to match our estimate of today’s coding uplift). This automation fraction is used in a task-based CES model to compute the aggregate coding labor at each time, which is a pretty standard technique in economics for modeling automation, but not necessarily good. We think a more gears-level model of the delegation / reviewing / etc process of agentic coding would be more accurate, but again didn’t come up with a fully-formed one in time. I’m curious to hear more about the data you’re collecting on this!
Seems useful to talk to us in person about interpretations of time horizon / why 130 years is maybe reasonable. Eli’s rationale for that estimate is written up here.
As with many things in this project, we wish we had more time to look into it, but didn’t prioritize it because we thought it would affect the results less than other things. When I briefly thought about this in the middle of the project, I remember getting confused about what “optimal” should mean. It also seems like it might increase the complexity of solving the model (it could increase the dimension of the system of differential equations, depending on how it’s done.)
Messy digression on ways one might do this
For example, how do you decide how to allocate compute between experiments and training? If your goal is “maximize the effective compute of your frontier model at the end of the year”, the optimal policy is to spend all of your compute on experiments now, then at some specific time switch to all-training. But setting a schedule of “deadlines” like this seems unnatural.
You could also imagine that at each point in time, the lab is spending an amortized “training budget” in H100e equaling
(size of actual frontier training system) x (fraction of each year during which it’s utilized for training production models)
or
(company H100e) x (fraction of H100e in frontier training system) x (fraction of each year when it’s utilized)
which is, assuming one frontier-scale production training run per year (which is maybe reasonable??):
(company H100e) x (fraction of H100e in frontier training system) x (min{1, training run length / 1 year}).
Jointly optimizing the FTS fraction, the training run length, and experiment compute seems like a bit of a mess, since the software efficiency that matters is probably the software efficiency at the beginning of the run, which you already decided at a previous timestep… possibly there’s a nice solution to this. Might think about it more later today.
One thing I agree would be easy and I should probably do is plot the implied MRTS between experiment compute and automation compute over time, i.e. the number of experiment H100e you’d need to gain such that simultaneously losing a single automation H100e doesn’t affect the software efficiency growth rate (or equivalently research effort). Theoretically this should always be 1. I bet it isn’t though.
Very interesting to hear! One main obstacle is that the model is being solved on the server rather than the client, so getting to 100ms is hard. There’s also a tradeoff with time resolution (with very fast takeoffs, the graphs already look a bit piecewise linear.) But I think there is definitely room for optimization in the solver.