I’ll follow the definition of AGI given in this Metaculus challenge, which roughly amounts to a single model that can “see, talk, act, and reason.” My predicted distribution is a weighted sum of two component distributions described below:
Prosaic AGI (25% probability). Timeline: 2024-2037 (Median: 2029): We develop AGI by scaling and combining existing techniques. The most probable paths I can foresee loosely involves 3 stages: (1) developing a language model with human-level language ability, then (2) giving it visual capabilities (i.e., talk about pictures and videos, solve SAT math problems with figures), and then (3) giving it capabilities to intelligently act in the world (i.e., trade stocks or navigate webpages). Below are my timelines for the above stages:
Human-level Language Model: 1.5-4.5 years (Median: 2.5 years). We can predictably improve our language models by increasing model size (parameter count), which we can do in the following two ways:
Scaling Language Model Size by 1000x relative to GPT3. 1000x is pretty feasible, but we’ll hit difficult hardware/communication bandwidth constraints beyond 1000x as I understand.
Increasing Effective Parameter Count by 100x using modeling tricks (Mixture of Experts, Sparse Tranformers, etc.)
+Visual Capabilities: 2-6 extra years (Median: 4 years). We’ll need good representation learning techniques for learning from visual input (which I think we mostly have). We’ll also need to combine vision and language models, but there are many existing techniques for combining vision and language models to try here, and they generally work pretty well. A main potential bottleneck time-wise is that the language+vision components will likely need to be pretrained together, which slows the iteration time and reduces the number of research groups that can contribute (especially for learning from video, which is expensive). For reference, Language+Image pretrained models like ViLBERT came out 10 months after BERT did.
+Action Capabilities: 0-6 extra years (Median: 2 years). GPT3-style zero-shot or few-shot instruction following is the most feasible/promising approach to me here; this approach could work as soon as we have a strong, pretrained vision+language model. Alternatively, we could use that model within a larger system, e.g. a policy trained with reinforcement learning, but this approach could take a while to get to work.
Breakthrough AGI (75% probability). Timeline: Uniform probability over the next century: We need several, fundamental breakthroughs to achieve AGI. Breakthroughs are hard to predict, so I’ll assume a uniform distribution that we’ll hit upon the necessary breakthroughs at any year <2100, with 15% total probability mass after 2100 (a rough estimate); I’m estimating 15% roughly based on a 5% probability that we won’t find the right insights by 2100, 5% probability that we have the right insights but not enough compute by 2100, and 5% probability to account for planning fallacy, unknown unknowns, and the fact that a number of top AI researchers believe that we are very far from AGI.
My probability for Prosaic AGI is based on an estimated probability of each of the 3 stages of development working (described above):
P(Prosaic AGI) = P(Stage 1) x P(Stage 2) x P(Stage 3) = 3⁄4 x 2⁄3 x 1⁄2 = 1⁄4
------------------
Updates/Clarification after some feedback from Adam Gleave:
Updated from 5% → 15% probability that AGI won’t happen by 2100 (see reasoning above). I’ve updated my Elicit snapshot appropriately.
There are other concrete paths to AGI, but I consider these fairly low probability to work first (<5%) and experimental enough that it’s hard to predict when they will work. For example, I can’t think of a good way to predict when we’ll get AGI from training agents in a simulated, multi-agent environment (e.g., in the style of OpenAI’s Emergent Tool Use paper). Thus, I think it’s reasonable to group such other paths to AGI into the “Breakthrough AGI” category and model these paths with a uniform distribution.
I think you can do better than a uniform distribution for the “Breakthrough AGI” category, by incorporating the following information:
Breakthroughs will be less frequent as time goes on, as the low-hanging fruit/insights are picked first. Adam suggested an exponential decay over time / Laplacian prior, which sounds reasonable.
Growth of AI research community: Estimate the size of the AI research community at various points in time, and estimate the pace of research progress given that community size. It seems reasonable to assume that the pace of progress will increase logarithmically in the size of the research community, but I can also see arguments for why we’d benefit more or less from a larger community (or even have slower progress).
Growth of funding/compute for AI research: As AI becomes increasingly monetizable, there will be more incentives for companies and governments to support AI research, e.g., in terms of growing industry labs, offering grants to academic labs to support researchers, and funding compute resources—each of these will speed up AI development.
Scaling Language Model Size by 1000x relative to GPT3. 1000x is pretty feasible, but we’ll hit difficult hardware/communication bandwidth constraints beyond 1000x as I understand.
I think people are hugely underestimating how much room there is to scale.
The difficulty, as you mention, is bandwidth and communication, rather than cost per bit in isolation. An A100 manages 1.6TB/sec of bandwidth to its 40 GB of memory. We can handle sacrificing some of this speed, but something like SSDs aren’t fast enough; 350 TB of SSD memory would cost just $40k, but would only manage 1-2 TB/s over the whole array, and could not push it to a single GPU. More DRAM on the GPU does hit physical scaling issues, and scaling out to larger clusters of GPUs does start to hit difficulties after a point.
This problem is not due to physical law, but the technologies in question. DRAM is fast, but has hit a scaling limit, whereas NAND scales well, but is much slower. And the larger the cluster of machines, the more bandwidth you have to sacrifice for signal integrity and routing.
Thing is, these are fixable issues if you allow for technology to shift. For example,
Various sorts of persistent memories allow fast dense memories, like NRAM. There’s also 3D XPoint and other ReRAMs, various sorts of MRAMs, etc.
Multiple technologies allow for connecting hardware significantly more densely than we currently do, primarily things like chiplets and memory stacking. Intel’s Ponte Vecchio intends to tie 96 (or 192?) compute dies together, across 6 interconnected GPUs, each made of 2 (or 4?) groups of 8 compute dies.
Neural networks are amicable to ‘spatial computing’ (visualization), and using appropriate algorithms the end-to-end latency can largely be ignored as long as the block-to-block latency and throughput is sufficiently high. This means there’s no clear limit to this sort of scaling, since the individual latencies are invariant to scale.
You mention this, but to complete the list, sparse training makes scale-out vastly easier, at the cost of reducing the effectiveness of scaling. GShard showed effectiveness at >99.9% sparsities for mixture-of-experts models, and it seems natural to imagine that a more flexible scheme with only, say, 90% training sparsity and support for full-density inference would allow for 10x scaling without meaningful downsides.
It seems plausible to me that a Manhattan Project could scale to models with a quintillion parameters, aka. 10,000,000x scaling, within 15 years, using only lightweight training sparsity. That’s not to say it’s necessarily feasible, but that I can’t rule out technology allowing that level of scaling.
When I cite scaling limit numbers, I’m mostly deferring to my personal discussions with Tim Dettmers (whose research is on hardware, sparsity, and language models), so I’d check out his comment on this post for more details on his view of why we’ll hit scaling limits soon!
I disagree with that post and its first two links so thoroughly that any direct reply or commentary on it would be more negative than I’d like to be on this site. (I do appreciate your comment, though, don’t take this as discouragement for clarifying your position.) I don’t want to leave it at that, so instead let me give a quick thought experiment.
A neuron’s signal hop latency is about 5ms, and in that time light can travel about 1500km, a distance approximately equal to the radius of the moon. You could build a machine literally the size of the moon, floating in deep space, before the speed of light between the neurons became a problem relative to the chemical signals in biology, as long as no single neuron went more than half way through. Unlike today’s silicon chips, a system like this would be restricted by the same latency propagation limits that the brain is, but still, it’s the size of the moon. You could hook this moon-sized computer to a human-shaped shell on Earth, and as long as the computer was directly overhead, the human body could be as responsive and fully updatable as a real human.
While such a computer is obviously impractical on so many levels, I find it a good frame of reference to think about the characteristics of how computers scale upwards, much like Feynman’s There’s Plenty of Room at the Bottom was a good frame of reference for scaling down, considered back when transistors were still wired by hand. In particular, the speed of light is not a problem, and will never become one, except where it’s a resource we use inefficiently.
Yes, the peak comes from (1) a relatively high (25%) confidence that current methods will lead to AGI and (2) my view that we’ll achieve Prosaic AGI in a pretty small (~13-year) window if it’s possible, after which it will be quite unlikely that scaling current methods will result in AGI (e.g., due to hitting scaling limits or a fundamental technical problem).
Here is my Elicit Snapshot.
I’ll follow the definition of AGI given in this Metaculus challenge, which roughly amounts to a single model that can “see, talk, act, and reason.” My predicted distribution is a weighted sum of two component distributions described below:
Prosaic AGI (25% probability). Timeline: 2024-2037 (Median: 2029): We develop AGI by scaling and combining existing techniques. The most probable paths I can foresee loosely involves 3 stages: (1) developing a language model with human-level language ability, then (2) giving it visual capabilities (i.e., talk about pictures and videos, solve SAT math problems with figures), and then (3) giving it capabilities to intelligently act in the world (i.e., trade stocks or navigate webpages). Below are my timelines for the above stages:
Human-level Language Model: 1.5-4.5 years (Median: 2.5 years). We can predictably improve our language models by increasing model size (parameter count), which we can do in the following two ways:
Scaling Language Model Size by 1000x relative to GPT3. 1000x is pretty feasible, but we’ll hit difficult hardware/communication bandwidth constraints beyond 1000x as I understand.
Increasing Effective Parameter Count by 100x using modeling tricks (Mixture of Experts, Sparse Tranformers, etc.)
+Visual Capabilities: 2-6 extra years (Median: 4 years). We’ll need good representation learning techniques for learning from visual input (which I think we mostly have). We’ll also need to combine vision and language models, but there are many existing techniques for combining vision and language models to try here, and they generally work pretty well. A main potential bottleneck time-wise is that the language+vision components will likely need to be pretrained together, which slows the iteration time and reduces the number of research groups that can contribute (especially for learning from video, which is expensive). For reference, Language+Image pretrained models like ViLBERT came out 10 months after BERT did.
+Action Capabilities: 0-6 extra years (Median: 2 years). GPT3-style zero-shot or few-shot instruction following is the most feasible/promising approach to me here; this approach could work as soon as we have a strong, pretrained vision+language model. Alternatively, we could use that model within a larger system, e.g. a policy trained with reinforcement learning, but this approach could take a while to get to work.
Breakthrough AGI (75% probability). Timeline: Uniform probability over the next century: We need several, fundamental breakthroughs to achieve AGI. Breakthroughs are hard to predict, so I’ll assume a uniform distribution that we’ll hit upon the necessary breakthroughs at any year <2100, with 15% total probability mass after 2100 (a rough estimate); I’m estimating 15% roughly based on a 5% probability that we won’t find the right insights by 2100, 5% probability that we have the right insights but not enough compute by 2100, and 5% probability to account for planning fallacy, unknown unknowns, and the fact that a number of top AI researchers believe that we are very far from AGI.
My probability for Prosaic AGI is based on an estimated probability of each of the 3 stages of development working (described above):
P(Prosaic AGI) = P(Stage 1) x P(Stage 2) x P(Stage 3) = 3⁄4 x 2⁄3 x 1⁄2 = 1⁄4
------------------
Updates/Clarification after some feedback from Adam Gleave:
Updated from 5% → 15% probability that AGI won’t happen by 2100 (see reasoning above). I’ve updated my Elicit snapshot appropriately.
There are other concrete paths to AGI, but I consider these fairly low probability to work first (<5%) and experimental enough that it’s hard to predict when they will work. For example, I can’t think of a good way to predict when we’ll get AGI from training agents in a simulated, multi-agent environment (e.g., in the style of OpenAI’s Emergent Tool Use paper). Thus, I think it’s reasonable to group such other paths to AGI into the “Breakthrough AGI” category and model these paths with a uniform distribution.
I think you can do better than a uniform distribution for the “Breakthrough AGI” category, by incorporating the following information:
Breakthroughs will be less frequent as time goes on, as the low-hanging fruit/insights are picked first. Adam suggested an exponential decay over time / Laplacian prior, which sounds reasonable.
Growth of AI research community: Estimate the size of the AI research community at various points in time, and estimate the pace of research progress given that community size. It seems reasonable to assume that the pace of progress will increase logarithmically in the size of the research community, but I can also see arguments for why we’d benefit more or less from a larger community (or even have slower progress).
Growth of funding/compute for AI research: As AI becomes increasingly monetizable, there will be more incentives for companies and governments to support AI research, e.g., in terms of growing industry labs, offering grants to academic labs to support researchers, and funding compute resources—each of these will speed up AI development.
I think people are hugely underestimating how much room there is to scale.
The difficulty, as you mention, is bandwidth and communication, rather than cost per bit in isolation. An A100 manages 1.6TB/sec of bandwidth to its 40 GB of memory. We can handle sacrificing some of this speed, but something like SSDs aren’t fast enough; 350 TB of SSD memory would cost just $40k, but would only manage 1-2 TB/s over the whole array, and could not push it to a single GPU. More DRAM on the GPU does hit physical scaling issues, and scaling out to larger clusters of GPUs does start to hit difficulties after a point.
This problem is not due to physical law, but the technologies in question. DRAM is fast, but has hit a scaling limit, whereas NAND scales well, but is much slower. And the larger the cluster of machines, the more bandwidth you have to sacrifice for signal integrity and routing.
Thing is, these are fixable issues if you allow for technology to shift. For example,
Various sorts of persistent memories allow fast dense memories, like NRAM. There’s also 3D XPoint and other ReRAMs, various sorts of MRAMs, etc.
Multiple technologies allow for connecting hardware significantly more densely than we currently do, primarily things like chiplets and memory stacking. Intel’s Ponte Vecchio intends to tie 96 (or 192?) compute dies together, across 6 interconnected GPUs, each made of 2 (or 4?) groups of 8 compute dies.
Neural networks are amicable to ‘spatial computing’ (visualization), and using appropriate algorithms the end-to-end latency can largely be ignored as long as the block-to-block latency and throughput is sufficiently high. This means there’s no clear limit to this sort of scaling, since the individual latencies are invariant to scale.
The switches themselves between the computers are not at a limit yet, because of silicon photonics, which can even be integrated alongside compute dies. That example is in a switch, but they can also be integrated alongside GPUs.
You mention this, but to complete the list, sparse training makes scale-out vastly easier, at the cost of reducing the effectiveness of scaling. GShard showed effectiveness at >99.9% sparsities for mixture-of-experts models, and it seems natural to imagine that a more flexible scheme with only, say, 90% training sparsity and support for full-density inference would allow for 10x scaling without meaningful downsides.
It seems plausible to me that a Manhattan Project could scale to models with a quintillion parameters, aka. 10,000,000x scaling, within 15 years, using only lightweight training sparsity. That’s not to say it’s necessarily feasible, but that I can’t rule out technology allowing that level of scaling.
When I cite scaling limit numbers, I’m mostly deferring to my personal discussions with Tim Dettmers (whose research is on hardware, sparsity, and language models), so I’d check out his comment on this post for more details on his view of why we’ll hit scaling limits soon!
I disagree with that post and its first two links so thoroughly that any direct reply or commentary on it would be more negative than I’d like to be on this site. (I do appreciate your comment, though, don’t take this as discouragement for clarifying your position.) I don’t want to leave it at that, so instead let me give a quick thought experiment.
A neuron’s signal hop latency is about 5ms, and in that time light can travel about 1500km, a distance approximately equal to the radius of the moon. You could build a machine literally the size of the moon, floating in deep space, before the speed of light between the neurons became a problem relative to the chemical signals in biology, as long as no single neuron went more than half way through. Unlike today’s silicon chips, a system like this would be restricted by the same latency propagation limits that the brain is, but still, it’s the size of the moon. You could hook this moon-sized computer to a human-shaped shell on Earth, and as long as the computer was directly overhead, the human body could be as responsive and fully updatable as a real human.
While such a computer is obviously impractical on so many levels, I find it a good frame of reference to think about the characteristics of how computers scale upwards, much like Feynman’s There’s Plenty of Room at the Bottom was a good frame of reference for scaling down, considered back when transistors were still wired by hand. In particular, the speed of light is not a problem, and will never become one, except where it’s a resource we use inefficiently.
That sharp peak feels really suspicious.
Yes, the peak comes from (1) a relatively high (25%) confidence that current methods will lead to AGI and (2) my view that we’ll achieve Prosaic AGI in a pretty small (~13-year) window if it’s possible, after which it will be quite unlikely that scaling current methods will result in AGI (e.g., due to hitting scaling limits or a fundamental technical problem).