One reason I put a bit more weight on short / medium horizons was that even if transformative tasks are long-horizon, you could use self-supervised pretraining to do most learning, thus reducing the long-horizon data requirements. Now that Scaling Laws for Transfer is out, we can use it to estimate how much this might help. So let’s do some bogus back-of-the-envelope calculations:
We’ll make the very questionable assumption that the law relating our short-horizon pretraining task and our long-horizon transformative task will still be DT=19,000(DF)0.18N0.38.
Let’s assume that the long-horizon transformative task has a horizon that is 7 orders of magnitude larger than the short-horizon pretraining task. (The full range is 9 orders of magnitude.) Let the from-scratch compute of a short-horizon transformative task be Mshort. Then the from-scratch compute for our long horizon task would be Mshort⋅1e7, if we had to train on all 1e13 data points.
Our key trick is going to be to make the model larger, and pretrain on a short-horizon task, to reduce the amount of long-horizon data we need. Suppose we multiply the model size by a factor of c. We’ll estimate total compute as a function of c, and then find the value that minimizes it.
Making the model bigger increases the necessary (short-horizon) pretraining data by a factor of c0.8, so pretraining compute goes up by a factor of c1.8. For transfer, DT goes up by a factor of c0.18.
We still want to have DE=DT+DF=1e13 data points for the long-horizon transformative task. To actually get computational savings, we need to get this primarily from transfer, i.e. we have DT=1.9⋅104(DF)0.18(3e14⋅c)0.38∼1e13, which we can solve to get DF∼7.74⋅1017⋅c−2.1.
Then the total compute is given by Mshort⋅c1.8+Mshort⋅107⋅DF1013=Mshort(c1.8+7.74⋅1011⋅c−2.1).
Minimizing this gives us c=1163, in which case total compute is Mshort⋅6e5.
Thus, we’ve taken a base time of Mshort⋅1e7, and reduced it down to Mshort⋅6e5, a little over an order of magnitude speedup. This is solidly within my previous expectations (looking at my notes, I said “a couple of orders of magnitude”), so my timelines don’t change much.
Some major caveats:
The calculation is super bogus; there’s no reason to expect k,α,β to be the same for TAI as for finetuning code completion on text; different values could wildly change the conclusion.
It’s not clear to me that this particular scaling law should be expected to hold for such large models.
There are lots of other reasons to prefer the short / medium horizon hypotheses (see my opinion above).
One reason I put a bit more weight on short / medium horizons was that even if transformative tasks are long-horizon, you could use self-supervised pretraining to do most learning, thus reducing the long-horizon data requirements. Now that Scaling Laws for Transfer is out, we can use it to estimate how much this might help. So let’s do some bogus back-of-the-envelope calculations:
We’ll make the very questionable assumption that the law relating our short-horizon pretraining task and our long-horizon transformative task will still be DT=19,000(DF)0.18N0.38.
Let’s assume that the long-horizon transformative task has a horizon that is 7 orders of magnitude larger than the short-horizon pretraining task. (The full range is 9 orders of magnitude.) Let the from-scratch compute of a short-horizon transformative task be Mshort. Then the from-scratch compute for our long horizon task would be Mshort⋅1e7, if we had to train on all 1e13 data points.
Our key trick is going to be to make the model larger, and pretrain on a short-horizon task, to reduce the amount of long-horizon data we need. Suppose we multiply the model size by a factor of c. We’ll estimate total compute as a function of c, and then find the value that minimizes it.
Making the model bigger increases the necessary (short-horizon) pretraining data by a factor of c0.8, so pretraining compute goes up by a factor of c1.8. For transfer, DT goes up by a factor of c0.18.
We still want to have DE=DT+DF=1e13 data points for the long-horizon transformative task. To actually get computational savings, we need to get this primarily from transfer, i.e. we have DT=1.9⋅104(DF)0.18(3e14⋅c)0.38∼1e13, which we can solve to get DF∼7.74⋅1017⋅c−2.1.
Then the total compute is given by Mshort⋅c1.8+Mshort⋅107⋅DF1013=Mshort(c1.8+7.74⋅1011⋅c−2.1).
Minimizing this gives us c=1163, in which case total compute is Mshort⋅6e5.
Thus, we’ve taken a base time of Mshort⋅1e7, and reduced it down to Mshort⋅6e5, a little over an order of magnitude speedup. This is solidly within my previous expectations (looking at my notes, I said “a couple of orders of magnitude”), so my timelines don’t change much.
Some major caveats:
The calculation is super bogus; there’s no reason to expect k,α,β to be the same for TAI as for finetuning code completion on text; different values could wildly change the conclusion.
It’s not clear to me that this particular scaling law should be expected to hold for such large models.
There are lots of other reasons to prefer the short / medium horizon hypotheses (see my opinion above).