Overall, the additional learning efficiency gains from these sources suggest that effective limits are 4 − 12 OOMs above the human brain. The high end seems extremely high, and we think there’s some risk of double counting some of the gains here in the different buckets, so we will bring down our high end to 10 OOMs.
When I count I get the lower bound 4 OOMs but the upper bound as 5+1+2.5+2+1+1.5+1=14 OOMs, rather than 12 OOMs. (On “Low fraction of data”, you say “at least 3-10″ so maybe the upper bound should really be higher there? 14 is assuming that it’s 1.)
FWIW, I’m also not sure that I find the high end to be so extremely high. We’re talking about the limits of what’s possible with arbitrarily powerful technology. We don’t really have reference points to that kind of things. And it’s not totally unheard of to make this kinds of progress — I think we made ~8 OOMs of progress in cost of transistors between 1969 and 2005 (based on the appendix of this paper), tough obviously this is somewhat cherry-picked.
One reason for scepticism here is that these gains in training efficiency would be muchbigger than anything we’ve seen historically. Epoch reports the training efficiency for GPT-2 increasing by 2 OOMs in a three year period, but doesn’t find examples of much bigger gains over any time period.
Are you referring to GPT-2-level performance, here? If so, that would be an example of “downwards” progress rather than “upwards” progress, right? Where we expect less “downwards” progress to be possible. I guess it’s harder to measure the “upwards” ones.
When I count I get the lower bound 4 OOMs but the upper bound as 5+1+2.5+2+1+1.5+1=14 OOMs, rather than 12 OOMs
Oops, you’re completely right. Great catch! For now I’m going to just leave the mainline results at 10 OOMs and edit the initial calc to land on 14 OOMs.
But I do think it’s worth exploring how sensitive the results are to this. I used the online tool to rerun the analysis increasing the upper bound from 10 OOMs to 16 OOMs. (That’s adding in 2 OOMs extra for the possibility some of my upper-bound ranges were too low, like the example you flagged.)
Let’s compare the key results. First, the old results with 10 OOMs:
Second, the new results with 16 OOMs:
So the probability of >3 years of progress in <1 year or <4 months doesn’t change much.
But the probability of >10 years of progress in <1 year or <4 months goes up a decent amount, a ~10% increase from a baseline of ~15%.
And the probability of >15 years in <4 months rises from ~0% to ~10%.
I think using 16 OOMs instead of 10 OOMs here is reasonable (though it seems too high to me), and so it’s reasonable to want to bump up your numbers here.
> One reason for scepticism here is that these gains in training efficiency would be muchbigger than anything we’ve seen historically.
Are you referring to GPT-2-level performance, here? If so, that would be an example of “downwards” progress rather than “upwards” progress, right? Where we expect less “downwards” progress to be possible. I guess it’s harder to measure the “upwards” ones.
This is a great point that I hadn’t appreciated. Epoch looked for algorithmic progress at a fixed capability level and never found improvements much bigger than 2 OOMs that they were confident in. (Both for GPT-2-level performance and other levels of performance I think, though the paper I think focuses on GPT-2 level.)
I’m not aware of data on upwards improvements. It should be possible in principle to look at. How good a model can you train with GPT-2-level/GPT-3-level compute today? How much compute would you have needed for that when the GPT-2/GPT-3 was first developed, extrapolating the scaling curves at that time?
So this does weaken the argument against a large upper bound here, thanks.
FWIW, I’m also not sure that I find the high end to be so extremely high. We’re talking about the limits of what’s possible with arbitrarily powerful technology. We don’t really have reference points to that kind of things. And it’s not totally unheard of to make this kinds of progress — I think we made ~8 OOMs of progress in cost of transistors between 1969 and 2005
That is a great example, though I think it starts from a place where the technology is much worse than that produced by evolution. My recollection is that the brain is still more efficient at FLOP per Joule than our best chips today. It would be interesting to estimate how far you could go beyond brain efficiency before hitting a limit.
Here I roughly estimate you could get 3e19 FLOP/J within Landauer’s limit. (You could go further with reversible computing.) That compares to the brain that each second does ~1e15 FLOP with 20 Joules --> 5e13 FLOP/J. (Which is a bit better than today’s chips, yes.) So that leaves ~6 OOMs of progress above the brain before hitting limits. That’s less than 14 OOMs, but only ~2X less in log space.
Thanks very much for this post!
When I count I get the lower bound 4 OOMs but the upper bound as 5+1+2.5+2+1+1.5+1=14 OOMs, rather than 12 OOMs. (On “Low fraction of data”, you say “at least 3-10″ so maybe the upper bound should really be higher there? 14 is assuming that it’s 1.)
FWIW, I’m also not sure that I find the high end to be so extremely high. We’re talking about the limits of what’s possible with arbitrarily powerful technology. We don’t really have reference points to that kind of things. And it’s not totally unheard of to make this kinds of progress — I think we made ~8 OOMs of progress in cost of transistors between 1969 and 2005 (based on the appendix of this paper), tough obviously this is somewhat cherry-picked.
Are you referring to GPT-2-level performance, here? If so, that would be an example of “downwards” progress rather than “upwards” progress, right? Where we expect less “downwards” progress to be possible. I guess it’s harder to measure the “upwards” ones.
Thanks for these great comments!
Oops, you’re completely right. Great catch! For now I’m going to just leave the mainline results at 10 OOMs and edit the initial calc to land on 14 OOMs.
But I do think it’s worth exploring how sensitive the results are to this. I used the online tool to rerun the analysis increasing the upper bound from 10 OOMs to 16 OOMs. (That’s adding in 2 OOMs extra for the possibility some of my upper-bound ranges were too low, like the example you flagged.)
Let’s compare the key results. First, the old results with 10 OOMs:
Second, the new results with 16 OOMs:
So the probability of >3 years of progress in <1 year or <4 months doesn’t change much.
But the probability of >10 years of progress in <1 year or <4 months goes up a decent amount, a ~10% increase from a baseline of ~15%.
And the probability of >15 years in <4 months rises from ~0% to ~10%.
I think using 16 OOMs instead of 10 OOMs here is reasonable (though it seems too high to me), and so it’s reasonable to want to bump up your numbers here.
This is a great point that I hadn’t appreciated. Epoch looked for algorithmic progress at a fixed capability level and never found improvements much bigger than 2 OOMs that they were confident in. (Both for GPT-2-level performance and other levels of performance I think, though the paper I think focuses on GPT-2 level.)
I’m not aware of data on upwards improvements. It should be possible in principle to look at. How good a model can you train with GPT-2-level/GPT-3-level compute today? How much compute would you have needed for that when the GPT-2/GPT-3 was first developed, extrapolating the scaling curves at that time?
So this does weaken the argument against a large upper bound here, thanks.
That is a great example, though I think it starts from a place where the technology is much worse than that produced by evolution. My recollection is that the brain is still more efficient at FLOP per Joule than our best chips today. It would be interesting to estimate how far you could go beyond brain efficiency before hitting a limit.
Here I roughly estimate you could get 3e19 FLOP/J within Landauer’s limit. (You could go further with reversible computing.) That compares to the brain that each second does ~1e15 FLOP with 20 Joules --> 5e13 FLOP/J. (Which is a bit better than today’s chips, yes.) So that leaves ~6 OOMs of progress above the brain before hitting limits. That’s less than 14 OOMs, but only ~2X less in log space.