The optimization target is the Helmholtz free energy functional in the conductance-corrected Wasserstein metric for the step-size effective loss potential in the critical batch size regime for the weight-initialization distribution as prior up to solenoidal flux corrections
Nah, I buy that they’re up so some wild stuff in the gradient descent dynamics / singular learning theory subfield, but solenoidal flux correction has to be a bit. The emperor has no clothes!
I think having all of this in mind as you train is actually pretty important. That way, when something doesn’t work, you know where to look:
Am I exploring enough, or stuck always pulling the first lever? (free energy)
Is it biased for some reason? (probably the metric)
Is it stuck not improving? (step or batch size)
Weight-initialization isn’t too helpful to think about yet (other than avoiding explosions at the very beginning of training, and maybe a little for transfer learning), but we’ll probably get hyper neural networks within a few years.
Reward is not the optimization target (during pretraining).
The optimization target (during pretraining) is the minimization of the empirical cross-entropy loss L = -∑log p(xᵢ|x₁,...,xᵢ₋₁), approximating the negative log-likelihood of the next-token prediction task under the autoregressive factorization p(x₁,...,xₙ)=∏p(xᵢ|x₁,...,xᵢ₋₁). The loss is computed over discrete tokens from subword vocabularies, averaged across sequences and batches, with gradient-based updates minimizing this singular objective. The optimization proceeds through multi-stage curricula: initial pretraining minimizing perplexity, followed by context-extension phases maintaining the same cross-entropy objective over longer sequences, and quality-annealing stages that reweight the loss toward higher-quality subsets while preserving the fundamental next-token prediction target.
The post-training optimization target is maximizing expected reward (under distributional constraints). Supervised fine-tuning first minimizes cross-entropy loss on target completions from instruction-response pairs, with optional prompt-masking excluding input tokens from the loss computation. Subsequent alignment introduces the constrained objective max_π E_x~π[R(x)] - βD_KL[π(x)||π_ref(x)], balancing reward maximization against divergence from the reference policy. This manifests through varied algorithmic realizations: Proximal Policy Optimization maximizes the clipped surrogate objective L^CLIP(θ) = E[min(rₜ(θ)Âₜ, clip(rₜ(θ), 1-ε, 1+ε)Âₜ)]; Direct Preference Optimization reformulates to minimize -E[(x_w,x_l)~D][log σ(β log π(x_w)/π_ref(x_w) - β log π(x_l)/π_ref(x_l))]; best-of-N sampling maximizes E[R(x*)] where x* = argmax_{x∈{x₁,...,xₙ}} R(x); Rejection Sampling Fine-tuning minimizes cross-entropy on the subset {x : R(x) > τ}; Kahneman-Tversky Optimization targets E[w(R(x))log π(x)] with prospect-theoretic weighting; Odds Ratio Preference Optimization combines -log π(x_w) - λ log[π(x_w)/(π(x_w) + π(x_l))]. The reward functions R(x) themselves are learned objectives, typically parameterized by neural networks minimizing E[(x_w,x_l)~D][-log σ(r(x_w) - r(x_l))] under the Bradley-Terry preference model, with rewards sourced from human annotations, AI-generated preferences, or constitutional specifications encoded as differentiable objectives.
for the weight-initialization distribution as prior
The bits of that I understand seem accurate but also it is not possible in the general case to predict (without doing the training run) how a given random initialization will affect what the final model looks like.
Which might have been the point you were trying to make, not sure.
I like this take, especially it’s precision, though I disagree in a few places.
conductance-corrected Wasserstein metric
This is the wrong metric, but I won’t help you find the right one.
the step-size effective loss potential
critical batch size regime
You can lower the step-size and increase the batch-size as you train to keep the perturbation bounded. Like, sure, you could claim an ODE solver doesn’t give you the exact solution, but adaptive methods let you get within any desired tolerance.
for the weight-initialization distribution
This is another “hyper”parameter to feed into the model. I agree that, at some point, the turtles have to stop, and we can call that the initial weight distribution, though I’d prefer the term ‘interpreter’.
Reward is not the optimization target.
The optimization target is the Helmholtz free energy functional in the conductance-corrected Wasserstein metric for the step-size effective loss potential in the critical batch size regime for the weight-initialization distribution as prior up to solenoidal flux corrections
Nah, I buy that they’re up so some wild stuff in the gradient descent dynamics / singular learning theory subfield, but solenoidal flux correction has to be a bit. The emperor has no clothes!
Cycling in GANs/self-play?
That may be true[1]. But it doesn’t seem like a particularly useful answer?
“The optimization target is the optimization target.”
For the outer optimiser that builds the AI
I think having all of this in mind as you train is actually pretty important. That way, when something doesn’t work, you know where to look:
Am I exploring enough, or stuck always pulling the first lever? (free energy)
Is it biased for some reason? (probably the metric)
Is it stuck not improving? (step or batch size)
Weight-initialization isn’t too helpful to think about yet (other than avoiding explosions at the very beginning of training, and maybe a little for transfer learning), but we’ll probably get hyper neural networks within a few years.
Would you like a zesty vinaigrette or just a sprinkling of more jargon on that word salad?
Reward is not the optimization target (during pretraining).
The optimization target (during pretraining) is the minimization of the empirical cross-entropy loss L = -∑log p(xᵢ|x₁,...,xᵢ₋₁), approximating the negative log-likelihood of the next-token prediction task under the autoregressive factorization p(x₁,...,xₙ)=∏p(xᵢ|x₁,...,xᵢ₋₁). The loss is computed over discrete tokens from subword vocabularies, averaged across sequences and batches, with gradient-based updates minimizing this singular objective. The optimization proceeds through multi-stage curricula: initial pretraining minimizing perplexity, followed by context-extension phases maintaining the same cross-entropy objective over longer sequences, and quality-annealing stages that reweight the loss toward higher-quality subsets while preserving the fundamental next-token prediction target.
The post-training optimization target is maximizing expected reward (under distributional constraints). Supervised fine-tuning first minimizes cross-entropy loss on target completions from instruction-response pairs, with optional prompt-masking excluding input tokens from the loss computation. Subsequent alignment introduces the constrained objective max_π E_x~π[R(x)] - βD_KL[π(x)||π_ref(x)], balancing reward maximization against divergence from the reference policy. This manifests through varied algorithmic realizations: Proximal Policy Optimization maximizes the clipped surrogate objective L^CLIP(θ) = E[min(rₜ(θ)Âₜ, clip(rₜ(θ), 1-ε, 1+ε)Âₜ)]; Direct Preference Optimization reformulates to minimize -E[(x_w,x_l)~D][log σ(β log π(x_w)/π_ref(x_w) - β log π(x_l)/π_ref(x_l))]; best-of-N sampling maximizes E[R(x*)] where x* = argmax_{x∈{x₁,...,xₙ}} R(x); Rejection Sampling Fine-tuning minimizes cross-entropy on the subset {x : R(x) > τ}; Kahneman-Tversky Optimization targets E[w(R(x))log π(x)] with prospect-theoretic weighting; Odds Ratio Preference Optimization combines -log π(x_w) - λ log[π(x_w)/(π(x_w) + π(x_l))]. The reward functions R(x) themselves are learned objectives, typically parameterized by neural networks minimizing E[(x_w,x_l)~D][-log σ(r(x_w) - r(x_l))] under the Bradley-Terry preference model, with rewards sourced from human annotations, AI-generated preferences, or constitutional specifications encoded as differentiable objectives.
What are solenoidal flux corrections in this context
The bits of that I understand seem accurate but also it is not possible in the general case to predict (without doing the training run) how a given random initialization will affect what the final model looks like.
Which might have been the point you were trying to make, not sure.
I like this take, especially it’s precision, though I disagree in a few places.
This is the wrong metric, but I won’t help you find the right one.
You can lower the step-size and increase the batch-size as you train to keep the perturbation bounded. Like, sure, you could claim an ODE solver doesn’t give you the exact solution, but adaptive methods let you get within any desired tolerance.
This is another “hyper”parameter to feed into the model. I agree that, at some point, the turtles have to stop, and we can call that the initial weight distribution, though I’d prefer the term ‘interpreter’.
Hmm… you sure you’re using the right flux? Not all boundaries of boundaries are zero, and GANs (and self-play) probably use a 6-complex.