I’m so excited to see this go live! I’ve learned a lot from it & consider it to do for takeoff speeds what Ajeya’s report did for timelines, i.e. it’s an actual fucking serious-ass gears-level model, the best that exists in the world for now. Future work will critique it and build off it rather than start from scratch, I say. Thanks Tom and Epoch and everyone else who contributed!
I strongly encourage everyone reading this to spend 10min playing around with the model, trying out different settings, etc. For example: Try to get it to match what you intuitively felt like timelines and takeoff would look like, and see how hard it is to get it to do so. Or: Go through the top 5-10 variables one by one and change them to what you think they should be (leaving unchanged the ones about which you have no opinion) and then see what effect each change has.
Almost two years ago I wrote this story of what the next five years would look like on my median timeline. At the time I had the bio anchors framework in mind with a median training requirements of 3e29. So, you can use this takeoff model as a nice complement to that story:
Set AGI training requirements to 3e29 instead of 1e36
(Optional) Set software returns to 2.5 instead of 1.25 (I endorse this change in general, because it’s more consistent with the empirical evidence. See Tom’s report for details & decide whether his justification for cutting it in half, to 1.25, is convincing.)
(Optional) Set FLOP gap to 1e2 instead of 1e4 (In general, as Tom discusses in the report, if training requirements are smaller then probably the FLOP gap is smaller too. So if we are starting with Tom’s best guess scenario and lowering the training requirements we should also lower the FLOP gap.)
The result:
In 2024, 4% of AI R&D tasks are automated; then 32% in 2026, and then singularity happens around when I expected, in mid 2028. This is close enough to what I had expected when I wrote the story that I’m tentatively making it canon.
Oh, also, a citation about my contribution to this post (Tom was going to make this a footnote but ran into technical difficulties): The extremely janky graph/diagram was made by me in may 2021, to help explain Ajeya’s Bio Anchors model. The graph that forms the bottom left corner came from some ARK Invest webpage which I can’t find now.
Have you considered the quantity of inference silicon required?
You buy out a large fraction of all of TSMC’s annual production and get the training compute. You now have one AGI. Replacing the cognitive work of one human requires a certain amount of silicon per human replaced, and the AGI model itself is large, it likely requires multiple nodes—possibly 100s—fully populated with 16 x <Nvidia’s latest accelerator> x $25,000 each.
So if say we need 250 nodes, times 16, times 25k, that’s 100 million USD. Say it is cognitively as effective as 10 humans in the top .1% of intelligence. That’s still not really changing anything unless it’s malicious. At this cost − 10 million an equivalent person—it’s just slightly more cost effective than training humans.
At this point there would be a ‘warning’ period as very large sums of money would have to be spent to scale production of the silicon to bring these costs down and to supply enough copies of the hardware that it’s equivalent to enough people cognitively to make a difference.
It is a positive feedback loop—each set of inference silicon is paying for itself in value in 10-20 years. Moreover once you have enough sets you could make a knockoff design for the chip and stop paying Nvidia, only paying for the silicon itself which is probably $2000 instead of $25,000. So an OOM more effective and it just gets faster from here.
The model does take into account the cost of runtime compute and how that affects demand, but mostly costs are dominated by training compute, not runtime compute. In the default settings for the model, it costs about 13 orders of magnitude more to train an AGI than to run one for 8 hours a day on 250 days (a typical human work year).
Note that mostly the model is modeling the takeoff to AGI and how pre-human level AI might affect things, especially the speed of that takeoff, rather than the effects of AGI itself.
Training the AGI may not be the expensive part. If we think current model architectures are flawed—they don’t use robust enough neural network architectures, they do not have the right topology needed to solve “AGI” grade cognitive tasks—then we need to search the spaces of:
1. Network subcomponents. Activation functions, larger blocks like transformers
2. Network architectures. Aka “n x m dimension of <architecture type X>, feeding into n x m dimension of <architecture type Y>”.
3. Cognitive architectures. Aka “system 1 output from a network of architecture type C feeds into a task meta controller that based on confidence either feeds the output to the robotics estimate module or...”. These are collections of modules, some of which will not even use neural networks, that form the cognitive architecture of the machine.
Technically an AGI is the combination of architectures that achieves “AGI level performance” (whatever heuristic we use) on a large and diverse benchmark of “AGI level tasks”. (tasks hard enough that 50% of humans will fail to pass them, or 99.9% depending on your AGI definition)
A superintelligence would be a machine that both passes a large AGI benchmark but scores better than all living humans (in the statistical sense, they are enough std devs away from the mean for humans that less than 1 in 8 billion humans are expected to be that good) on most tasks.
So if you think about it, searching this space—by making many failed AGI candidates to gain information about the possibility space—could eat up many OOMs more compute. If we have to make 1e6 full AGI models for example, each one failing the benchmark but some doing better than others.
It may not be this difficult though, and it’s possible that ANY architecture that has something for the minimum essential parts and sufficient compute to train it will pass the bench.
I’m so excited to see this go live! I’ve learned a lot from it & consider it to do for takeoff speeds what Ajeya’s report did for timelines, i.e. it’s an actual fucking serious-ass gears-level model, the best that exists in the world for now. Future work will critique it and build off it rather than start from scratch, I say. Thanks Tom and Epoch and everyone else who contributed!
I strongly encourage everyone reading this to spend 10min playing around with the model, trying out different settings, etc. For example: Try to get it to match what you intuitively felt like timelines and takeoff would look like, and see how hard it is to get it to do so. Or: Go through the top 5-10 variables one by one and change them to what you think they should be (leaving unchanged the ones about which you have no opinion) and then see what effect each change has.
Almost two years ago I wrote this story of what the next five years would look like on my median timeline. At the time I had the bio anchors framework in mind with a median training requirements of 3e29. So, you can use this takeoff model as a nice complement to that story:
Go to takeoffspeeds.com and load the preset: best guess scenario.
Set AGI training requirements to 3e29 instead of 1e36
(Optional) Set software returns to 2.5 instead of 1.25 (I endorse this change in general, because it’s more consistent with the empirical evidence. See Tom’s report for details & decide whether his justification for cutting it in half, to 1.25, is convincing.)
(Optional) Set FLOP gap to 1e2 instead of 1e4 (In general, as Tom discusses in the report, if training requirements are smaller then probably the FLOP gap is smaller too. So if we are starting with Tom’s best guess scenario and lowering the training requirements we should also lower the FLOP gap.)
The result:
In 2024, 4% of AI R&D tasks are automated; then 32% in 2026, and then singularity happens around when I expected, in mid 2028. This is close enough to what I had expected when I wrote the story that I’m tentatively making it canon.
Oh, also, a citation about my contribution to this post (Tom was going to make this a footnote but ran into technical difficulties): The extremely janky graph/diagram was made by me in may 2021, to help explain Ajeya’s Bio Anchors model. The graph that forms the bottom left corner came from some ARK Invest webpage which I can’t find now.
(ARK Invest source is here, and they basically got it from the addendum to AI and Compute.)
Have you considered the quantity of inference silicon required?
You buy out a large fraction of all of TSMC’s annual production and get the training compute. You now have one AGI. Replacing the cognitive work of one human requires a certain amount of silicon per human replaced, and the AGI model itself is large, it likely requires multiple nodes—possibly 100s—fully populated with 16 x <Nvidia’s latest accelerator> x $25,000 each.
So if say we need 250 nodes, times 16, times 25k, that’s 100 million USD. Say it is cognitively as effective as 10 humans in the top .1% of intelligence. That’s still not really changing anything unless it’s malicious. At this cost − 10 million an equivalent person—it’s just slightly more cost effective than training humans.
At this point there would be a ‘warning’ period as very large sums of money would have to be spent to scale production of the silicon to bring these costs down and to supply enough copies of the hardware that it’s equivalent to enough people cognitively to make a difference.
It is a positive feedback loop—each set of inference silicon is paying for itself in value in 10-20 years. Moreover once you have enough sets you could make a knockoff design for the chip and stop paying Nvidia, only paying for the silicon itself which is probably $2000 instead of $25,000. So an OOM more effective and it just gets faster from here.
The model does take into account the cost of runtime compute and how that affects demand, but mostly costs are dominated by training compute, not runtime compute. In the default settings for the model, it costs about 13 orders of magnitude more to train an AGI than to run one for 8 hours a day on 250 days (a typical human work year).
You can see and adjust those settings here: https://takeoffspeeds.com/
Note that mostly the model is modeling the takeoff to AGI and how pre-human level AI might affect things, especially the speed of that takeoff, rather than the effects of AGI itself.
Training the AGI may not be the expensive part. If we think current model architectures are flawed—they don’t use robust enough neural network architectures, they do not have the right topology needed to solve “AGI” grade cognitive tasks—then we need to search the spaces of:
1. Network subcomponents. Activation functions, larger blocks like transformers
2. Network architectures. Aka “n x m dimension of <architecture type X>, feeding into n x m dimension of <architecture type Y>”.
3. Cognitive architectures. Aka “system 1 output from a network of architecture type C feeds into a task meta controller that based on confidence either feeds the output to the robotics estimate module or...”. These are collections of modules, some of which will not even use neural networks, that form the cognitive architecture of the machine.
Technically an AGI is the combination of architectures that achieves “AGI level performance” (whatever heuristic we use) on a large and diverse benchmark of “AGI level tasks”. (tasks hard enough that 50% of humans will fail to pass them, or 99.9% depending on your AGI definition)
A superintelligence would be a machine that both passes a large AGI benchmark but scores better than all living humans (in the statistical sense, they are enough std devs away from the mean for humans that less than 1 in 8 billion humans are expected to be that good) on most tasks.
So if you think about it, searching this space—by making many failed AGI candidates to gain information about the possibility space—could eat up many OOMs more compute. If we have to make 1e6 full AGI models for example, each one failing the benchmark but some doing better than others.
It may not be this difficult though, and it’s possible that ANY architecture that has something for the minimum essential parts and sufficient compute to train it will pass the bench.