Jones (2021) and EpochAI both estimate that you need to scale-up inference by roughly 1,000x to reach the same capability you’d get from a 100x scale-up of training.
This also is confusing to me. Suppose that we scaled up training a hundred times. Then we are either overtraining the model (which does not increase performance beyond a level determined by the model’s size!) or are working with a different model which has about a hundred times more parameters. Then what does scaling up inference mean, a tenfold increase in the number of tokens used?
This also is confusing to me. Suppose that we scaled up training a hundred times. Then we are either overtraining the model (which does not increase performance beyond a level determined by the model’s size!) or are working with a different model which has about a hundred times more parameters. Then what does scaling up inference mean, a tenfold increase in the number of tokens used?