This paper (https://arxiv.org/abs/2010.14701) shows the existence of constant terms in other generative modelling settings and relates it to the entropy of the dataset, where you can’t compress beyond. It also gives empirical evidence that downstream performance in things like “finetuning a generative model to be a classifier” continues to improve as you asymptote to the constant. From a physics perspective, the constant term and coefficients on the power law pieces are “non universal data” while the exponent is going to tell you more about the model, training scheme, problem, etc.
This paper (https://arxiv.org/abs/2010.14701) shows the existence of constant terms in other generative modelling settings and relates it to the entropy of the dataset, where you can’t compress beyond. It also gives empirical evidence that downstream performance in things like “finetuning a generative model to be a classifier” continues to improve as you asymptote to the constant. From a physics perspective, the constant term and coefficients on the power law pieces are “non universal data” while the exponent is going to tell you more about the model, training scheme, problem, etc.
Thanks, I hadn’t seen that! Added it to the post