I think fitting the Metr scaling law with effective compute on the x-axis is slightly wrong. I agree that if people completely stopped investing in AI, or if you got to a point where AI massively sped up progress, the trend would break, but I think that before then, a straight line is a better model than modeling that tries to take into account compute investment slow down or doublings getting easier significantly before automated coders.
My best guess is that if you had done the same exercise with semi-conductors in 1990, you would have made bad predictions. For example, Moore’s law doesn’t hold that well with semi-conductor log(revenue+investments) on the x-axis (according to this plot generated by Claude).
(I think Metr time horizon might not be the right y-axis, maybe more akin to clock speed than number of transistor on a chip, and AI-speeding-up-AI slightly before AC is important when forecasting AGI, so I don’t claim that just saying “the line will be straight” removes the need for some other forms of modeling.)
The fit is notably better for “cumulative investment over time”. Years still produces a slightly better fit.
I’ve cut off the fit as of 2010, about when the original version of moore’s law stops. If you try to project out after 2010, then I think cumulative investment would do better, but I think only because of investment slowing in response to moore’s law dying.
(Doing the fit to a lagged-by-3 years investment series doesn’t make any important difference.)
Interesting, thanks! Why though? Like, if a massive increase or decrease in investment would break the trend, shouldn’t a moderate increase or decrease in investment bend the trend?
The first graph you share is fascinating to me because normally I’d assume that the wright’s law / experience curve for a technology gets harder over time, i.e. you start out with some sort of “N doublings of performance for every doubling of cumulative investment” number that gradually gets smaller over time as you approach limits. But here it seems that N has actually been increasing over time!
My guess of what’s going on is that something like “serial progress” (maybe within the industry, maybe also tied with progress in the rest of the world) matters a lot and so the 1st order predictions with calendar time as x axis are often surprisingly good. There are effects in both directions fighting against the straight line (positive and negative feedback loops, some things getting harder over time, and some things getting easier over time), but they usually roughly cancel out unless they are very big.
In the case of semiconductors, one effect that could push progress up is that better semiconductors might help you build better semiconductors (e.g. the design process uses compute-heavy computer-assisted design if I understand correctly)?
Although superficially similar, I think these are asking different kinds of question.
Chips are a classic production efficiency ‘industrial learning curve’. Wright’s law is the generic pattern which often holds there: efficiency gains per order of magnitude ‘units produced’. As it happens, people have produced exponentially many chips over time, so you also get a smooth line if you plot against time: that’s Moore’s law.
We might expect similar learning curve patterns to hold for something like ‘cost per token’ vs ‘tokens served’. I’m not aware of definitive public data on those, but superficially the pricing looks to support that. (Incidentally this is another reason I think ‘experimental throughput’, including serving at scale, is crucial to compute efficiency gains.)
In contrast, time horizons are more of a ‘scaling law’ question, where something like (‘effective’) input scale is the right kind of independent variable to track.
I read an article about the history of extreme ultraviolet lithography (http://dx.doi.org/10.1116/1.2127950, the full pdf is on sci-hub) which says that soft x-ray reduction lithography using multilayer-coated schwertzchild optics was demonstrated in 1986.
3 nm process nodes have a contacted gate pitch of 48 nanometers, and a tightest metal pitch of 24 nanometers, so a laser with wavelength near 13.5 nm is needed to etch the circuits onto the chip dies with sufficient precision.
Of course, there were many practical engineering challenges with getting this concept to work at scale (there is a video by veritasium which discusses this in more detail), and I think very few people making compute forecasts in 1990 would have accurately predicted the trajectory of this technology.
I think fitting the Metr scaling law with effective compute on the x-axis is slightly wrong. I agree that if people completely stopped investing in AI, or if you got to a point where AI massively sped up progress, the trend would break, but I think that before then, a straight line is a better model than modeling that tries to take into account compute investment slow down or doublings getting easier significantly before automated coders.
My best guess is that if you had done the same exercise with semi-conductors in 1990, you would have made bad predictions. For example, Moore’s law doesn’t hold that well with semi-conductor log(revenue+investments) on the x-axis (according to this plot generated by Claude).
(I think Metr time horizon might not be the right y-axis, maybe more akin to clock speed than number of transistor on a chip, and AI-speeding-up-AI slightly before AC is important when forecasting AGI, so I don’t claim that just saying “the line will be straight” removes the need for some other forms of modeling.)
The fit is notably better for “cumulative investment over time”. Years still produces a slightly better fit.
I’ve cut off the fit as of 2010, about when the original version of moore’s law stops. If you try to project out after 2010, then I think cumulative investment would do better, but I think only because of investment slowing in response to moore’s law dying.
(Doing the fit to a lagged-by-3 years investment series doesn’t make any important difference.)
Interesting, thanks! Why though? Like, if a massive increase or decrease in investment would break the trend, shouldn’t a moderate increase or decrease in investment bend the trend?
The first graph you share is fascinating to me because normally I’d assume that the wright’s law / experience curve for a technology gets harder over time, i.e. you start out with some sort of “N doublings of performance for every doubling of cumulative investment” number that gradually gets smaller over time as you approach limits. But here it seems that N has actually been increasing over time!
My guess of what’s going on is that something like “serial progress” (maybe within the industry, maybe also tied with progress in the rest of the world) matters a lot and so the 1st order predictions with calendar time as x axis are often surprisingly good. There are effects in both directions fighting against the straight line (positive and negative feedback loops, some things getting harder over time, and some things getting easier over time), but they usually roughly cancel out unless they are very big.
In the case of semiconductors, one effect that could push progress up is that better semiconductors might help you build better semiconductors (e.g. the design process uses compute-heavy computer-assisted design if I understand correctly)?
Although superficially similar, I think these are asking different kinds of question.
Chips are a classic production efficiency ‘industrial learning curve’. Wright’s law is the generic pattern which often holds there: efficiency gains per order of magnitude ‘units produced’. As it happens, people have produced exponentially many chips over time, so you also get a smooth line if you plot against time: that’s Moore’s law.
We might expect similar learning curve patterns to hold for something like ‘cost per token’ vs ‘tokens served’. I’m not aware of definitive public data on those, but superficially the pricing looks to support that. (Incidentally this is another reason I think ‘experimental throughput’, including serving at scale, is crucial to compute efficiency gains.)
In contrast, time horizons are more of a ‘scaling law’ question, where something like (‘effective’) input scale is the right kind of independent variable to track.
I read an article about the history of extreme ultraviolet lithography (http://dx.doi.org/10.1116/1.2127950, the full pdf is on sci-hub) which says that soft x-ray reduction lithography using multilayer-coated schwertzchild optics was demonstrated in 1986.
3 nm process nodes have a contacted gate pitch of 48 nanometers, and a tightest metal pitch of 24 nanometers, so a laser with wavelength near 13.5 nm is needed to etch the circuits onto the chip dies with sufficient precision.
Of course, there were many practical engineering challenges with getting this concept to work at scale (there is a video by veritasium which discusses this in more detail), and I think very few people making compute forecasts in 1990 would have accurately predicted the trajectory of this technology.