The scaling laws, at least according to the interpretation used in Ajeya’s framework (and this seems to be basically endorsed by tons of people I respect on this matter) say basically that if you increase parameter count by an order of magnitude, you also need to increase training steps/data points by about an order of magnitude, or else you are wasting your compute and could get the same performance with a smaller parameter count. For example, for a 10^14 parameter model (the size of the human brain, basically) we’d need 10^13 training steps/data points.

Now we have papers like this one claiming that pre-trained transformers can be fine-tuned to do well at completely different tasks (incl. different modalities!) by only modifying 0.1% of the parameters.

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws? I’m guessing the answer is no, but I don’t know why, so I’m asking.

(If the answer is yes, how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?)

## [Question] How do scaling laws work for fine-tuning?

The scaling laws, at least according to the interpretation used in Ajeya’s framework (and this seems to be basically endorsed by tons of people I respect on this matter) say basically that if you increase parameter count by an order of magnitude, you also need to increase training steps/data points by about an order of magnitude, or else you are wasting your compute and could get the same performance with a smaller parameter count. For example, for a 10^14 parameter model (the size of the human brain, basically) we’d need 10^13 training steps/data points.

Now we have papers like this one claiming that pre-trained transformers can be fine-tuned to do well at completely different tasks (incl. different modalities!) by only modifying 0.1% of the parameters.

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws? I’m guessing the answer is no, but I don’t know why, so I’m asking.

(If the answer is yes, how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?)