[Question] How do scaling laws work for fine-tuning?

The scaling laws, at least according to the interpretation used in Ajeya’s framework (and this seems to be basically endorsed by tons of people I respect on this matter) say basically that if you increase parameter count by an order of magnitude, you also need to increase training steps/​data points by about an order of magnitude, or else you are wasting your compute and could get the same performance with a smaller parameter count. For example, for a 10^14 parameter model (the size of the human brain, basically) we’d need 10^13 training steps/​data points.

Now we have papers like this one claiming that pre-trained transformers can be fine-tuned to do well at completely different tasks (incl. different modalities!) by only modifying 0.1% of the parameters.

Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws? I’m guessing the answer is no, but I don’t know why, so I’m asking.

(If the answer is yes, how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?)

No comments.