Appendix: Estimating the relationship between algorithmic improvement and labor production
In particular, if we fix the architecture to use a token abstraction and consider training a new improved model: we care about how much cheaper you make generating tokens at a given level of performance (in inference tok/flop), how much serially faster you make generating tokens at a given level of performance (in serial speed: tok/s at a fixed level of tok/flop), and how much more performance you can get out of tokens (labor/tok, really per serial token). Then, for a given new model with reduced cost, increased speed, and increased production per token and assuming a parallelism penalty of 0.7, we can compute the increase in production as roughly: cost_reduction0.7⋅speed_increase(1−0.7)⋅productivity_multiplier[1] (I can show the math for this if there is interest).
My sense is that reducing inference compute needed for a fixed level of capability that you already have (using a fixed amount of training run) is usually somewhat easier than making frontier compute go further by some factor, though I don’t think it is easy to straightforwardly determine how much easier this is[2]. Let’s say there is a 1.25 exponent on reducing cost (as in, 2x algorithmic efficiency improvement is as hard as a 21.25=2.38 reduction in cost)? (I’m also generally pretty confused about what the exponent should be. I think exponents from 0.5 to 2 seem plausible, though I’m pretty confused. 0.5 would correspond to the square root from just scaling data in scaling laws.) It seems substantially harder to increase speed than to reduce cost as speed is substantially constrained by serial depth, at least when naively applying transformers. Naively, reducing cost by β (which implies reducing parameters by β) will increase speed by somewhat more than β1/3 as depth is cubic in layers. I expect you can do somewhat better than this because reduced matrix sizes also increase speed (it isn’t just depth) and because you can introduce speed-specific improvements (that just improve speed and not cost). But this factor might be pretty small, so let’s stick with 13 for now and ignore speed-specific improvements. Now, let’s consider the case where we don’t have productivity multipliers (which is strictly more conservative). Then, we get that increase in labor production is:
So, these numbers ended up yielding an exact equivalence between frontier algorithmic improvement and effective labor production increases. (This is a coincidence, though I do think the exponent is close to 1.)
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around algo_improvement1.1. If the coefficient on reducing cost was much worse, we would invest more in improving productivity per token, which bounds the returns somewhat.
Appendix: Isn’t compute tiny and decreasing per researcher?
One relevant objection is: Ok, but is this really feasible? Wouldn’t this imply that each AI researcher has only a tiny amount of compute? After all, if you use 20% of compute for inference of AI research labor, then each AI only gets 4x more compute to run experiments than for inference on itself? And, as you do algorithmic improvement to reduce AI cost and run more AIs, you also reduce the compute per AI! First, it is worth noting that as we do algorithmic progress, both the cost of AI researcher inference and the cost of experiments on models of a given level of capability go down. Precisely, for any experiment that involves a fixed number of inference or gradient steps on a model which is some fixed effective compute multiplier below/above the performance of our AI laborers, cost is proportional to inference cost (so, as we improve our AI workforce, experiment cost drops proportionally). However, for experiments that involve training a model from scratch, I expect the reduction in experiment cost to be relatively smaller such that such experiments must become increasingly small relative to frontier scale. Overall, it might be important to mostly depend on approaches which allow for experiments that don’t require training runs from scratch or to adapt to increasingly smaller full experiment training runs. To the extent AIs are made smarter rather than more numerous, this isn’t a concern. Additionally, we only need so many orders of magnitude of growth. In principle, this consideration should be captured by the exponents in the compute vs. labor production function, but it is possible this production function has very different characteristics in the extremes. Overall, I do think this concern is somewhat important, but I don’t think it is a dealbreaker for a substantial number of OOMs of growth.
Appendix: Can’t algorithmic efficiency only get so high?
My sense is that this isn’t very close to being a blocker. Here is a quick bullet point argument (from some slides I made) that takeover-capable AI is possible on current hardware.
Human brain is perhaps ~1e14 FLOP/s
With that efficiency, each H100 can run 10 humans (current cost $2 / hour)
10s of millions of human-level AIs with just current hardware production
Human brain is probably very suboptimal:
AIs already much better at many subtasks
Possible to do much more training than within lifetime training with parallelism
Biological issues: locality, noise, focused on sensory processing, memory limits
Smarter AI could be more efficient (smarter humans use less FLOP per task)
AI could be 1e2-1e7 more efficient on tasks like coding, engineering
This is just approximate because you can also trade off speed with cost in complicated ways and research new ways to more efficiently trade off speed and cost. I’ll be ignoring this for now.
It’s hard to determine because inference cost reductions have been driven by spending more compute on making smaller models e.g., training a smaller model for longer rather than just being driven by algorithmic improvement, and I don’t have great numbers on the difference off the top of my head.
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around algo_improvement1.1.
When considering an “efficiency only singularity”, some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that “for each x% increase in cumulative R&D inputs, the output metric will increase by r*x”. The condition for increasing returns is r>1.)
I said I was 50-50 on an efficiency only singularity happening, at least temporarily. Based on these additional considerations I’m now at more like ~85% on a software only singularity. And I’d guess that initially r = ~3 (though I still think values as low as 0.5 or as high as 6 as plausible). There seem to be many strong ~independent reasons to think capability improvements would be a really huge deal compared to pure efficiency problems, and this is borne out by toy models of the dynamic.
Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)
Appendix: Estimating the relationship between algorithmic improvement and labor production
In particular, if we fix the architecture to use a token abstraction and consider training a new improved model: we care about how much cheaper you make generating tokens at a given level of performance (in inference tok/flop), how much serially faster you make generating tokens at a given level of performance (in serial speed: tok/s at a fixed level of tok/flop), and how much more performance you can get out of tokens (labor/tok, really per serial token). Then, for a given new model with reduced cost, increased speed, and increased production per token and assuming a parallelism penalty of 0.7, we can compute the increase in production as roughly: cost_reduction0.7⋅speed_increase(1−0.7)⋅productivity_multiplier[1] (I can show the math for this if there is interest).
My sense is that reducing inference compute needed for a fixed level of capability that you already have (using a fixed amount of training run) is usually somewhat easier than making frontier compute go further by some factor, though I don’t think it is easy to straightforwardly determine how much easier this is[2]. Let’s say there is a 1.25 exponent on reducing cost (as in, 2x algorithmic efficiency improvement is as hard as a 21.25=2.38 reduction in cost)? (I’m also generally pretty confused about what the exponent should be. I think exponents from 0.5 to 2 seem plausible, though I’m pretty confused. 0.5 would correspond to the square root from just scaling data in scaling laws.) It seems substantially harder to increase speed than to reduce cost as speed is substantially constrained by serial depth, at least when naively applying transformers. Naively, reducing cost by β (which implies reducing parameters by β) will increase speed by somewhat more than β1/3 as depth is cubic in layers. I expect you can do somewhat better than this because reduced matrix sizes also increase speed (it isn’t just depth) and because you can introduce speed-specific improvements (that just improve speed and not cost). But this factor might be pretty small, so let’s stick with 13 for now and ignore speed-specific improvements. Now, let’s consider the case where we don’t have productivity multipliers (which is strictly more conservative). Then, we get that increase in labor production is:
cost_reduction0.7⋅cost_reduction1/3⋅(1−0.7)=cost_reduction0.8=algo_improvement1.25⋅0.8=algo_improvement1
So, these numbers ended up yielding an exact equivalence between frontier algorithmic improvement and effective labor production increases. (This is a coincidence, though I do think the exponent is close to 1.)
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around algo_improvement1.1. If the coefficient on reducing cost was much worse, we would invest more in improving productivity per token, which bounds the returns somewhat.
Appendix: Isn’t compute tiny and decreasing per researcher?
One relevant objection is: Ok, but is this really feasible? Wouldn’t this imply that each AI researcher has only a tiny amount of compute? After all, if you use 20% of compute for inference of AI research labor, then each AI only gets 4x more compute to run experiments than for inference on itself? And, as you do algorithmic improvement to reduce AI cost and run more AIs, you also reduce the compute per AI! First, it is worth noting that as we do algorithmic progress, both the cost of AI researcher inference and the cost of experiments on models of a given level of capability go down. Precisely, for any experiment that involves a fixed number of inference or gradient steps on a model which is some fixed effective compute multiplier below/above the performance of our AI laborers, cost is proportional to inference cost (so, as we improve our AI workforce, experiment cost drops proportionally). However, for experiments that involve training a model from scratch, I expect the reduction in experiment cost to be relatively smaller such that such experiments must become increasingly small relative to frontier scale. Overall, it might be important to mostly depend on approaches which allow for experiments that don’t require training runs from scratch or to adapt to increasingly smaller full experiment training runs. To the extent AIs are made smarter rather than more numerous, this isn’t a concern. Additionally, we only need so many orders of magnitude of growth. In principle, this consideration should be captured by the exponents in the compute vs. labor production function, but it is possible this production function has very different characteristics in the extremes. Overall, I do think this concern is somewhat important, but I don’t think it is a dealbreaker for a substantial number of OOMs of growth.
Appendix: Can’t algorithmic efficiency only get so high?
My sense is that this isn’t very close to being a blocker. Here is a quick bullet point argument (from some slides I made) that takeover-capable AI is possible on current hardware.
Human brain is perhaps ~1e14 FLOP/s
With that efficiency, each H100 can run 10 humans (current cost $2 / hour)
10s of millions of human-level AIs with just current hardware production
Human brain is probably very suboptimal:
AIs already much better at many subtasks
Possible to do much more training than within lifetime training with parallelism
Biological issues: locality, noise, focused on sensory processing, memory limits
Smarter AI could be more efficient (smarter humans use less FLOP per task)
AI could be 1e2-1e7 more efficient on tasks like coding, engineering
Probably smaller improvement on video processing
Say, 1e4 so 100,000 per H100
Qualitative intelligence could be a big deal
Seems like peak efficiency isn’t a blocker.
This is just approximate because you can also trade off speed with cost in complicated ways and research new ways to more efficiently trade off speed and cost. I’ll be ignoring this for now.
It’s hard to determine because inference cost reductions have been driven by spending more compute on making smaller models e.g., training a smaller model for longer rather than just being driven by algorithmic improvement, and I don’t have great numbers on the difference off the top of my head.
Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singularity appendix.
When considering an “efficiency only singularity”, some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that “for each x% increase in cumulative R&D inputs, the output metric will increase by r*x”. The condition for increasing returns is r>1.)
Whereas when including capability improvements:
Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)
Interesting. My numbers aren’t very principled and I could imagine thinking capability improvements are a big deal for the bottom line.