One simple example: pretraining an LLM[1] will converge to an approximation of Bayesian inference over the training distribution[2]. Any part of the output distribution clearly off the pareto frontier of approximation with respect to the architecture/training scheme/data would be akin to an anomalous and persistent area of high pressure within a gas; it’s an unstable high energy state with respect to optimization.
I can’t tell you with certainty how the weights will move during the training process, but I can tell you where it’s going at a higher level.
Throwing some weight behind this: I’ve been impressed by Alex. He’s got a rare combination of thoughtfulness, orientation to important causes, and savvy deliberative pragmatism in politics.