One simple example: pretraining an LLM[1] will converge to an approximation of Bayesian inference over the training distribution[2]. Any part of the output distribution clearly off the pareto frontier of approximation with respect to the architecture/training scheme/data would be akin to an anomalous and persistent area of high pressure within a gas; it’s an unstable high energy state with respect to optimization.
I can’t tell you with certainty how the weights will move during the training process, but I can tell you where it’s going at a higher level.
there are degrees of freedom outside of the training distribution that aren’t bound by the same dynamics. Adversarial intervention to build and maintain circuitry outside the training distribution could survive the pressure, for example.
One simple example: pretraining an LLM[1] will converge to an approximation of Bayesian inference over the training distribution[2]. Any part of the output distribution clearly off the pareto frontier of approximation with respect to the architecture/training scheme/data would be akin to an anomalous and persistent area of high pressure within a gas; it’s an unstable high energy state with respect to optimization.
I can’t tell you with certainty how the weights will move during the training process, but I can tell you where it’s going at a higher level.
assuming proper scoring rule etc.
there are degrees of freedom outside of the training distribution that aren’t bound by the same dynamics. Adversarial intervention to build and maintain circuitry outside the training distribution could survive the pressure, for example.
Relevant xkcd: link.