Alexander Gietelink Oldenziel comments on Alexander Gietelink Oldenziel’s Shortform

Alexander Gietelink Oldenziel 24 Sep 2025 12:44 UTC
30 points
5
Reward is not the optimization target.
The optimization target is the Helmholtz free energy functional in the conductance-corrected Wasserstein metric for the step-size effective loss potential in the critical batch size regime for the weight-initialization distribution as prior up to solenoidal flux corrections
- Hastings 24 Sep 2025 19:58 UTC
  7 points
  2
  Parent
  Nah, I buy that they’re up so some wild stuff in the gradient descent dynamics / singular learning theory subfield, but solenoidal flux correction has to be a bit. The emperor has no clothes!
  - James Camacho 25 Sep 2025 1:20 UTC
    2 points
    0
    Parent
    Cycling in GANs/self-play?
- Lucius Bushnaq 24 Sep 2025 16:46 UTC
  5 points
  0
  Parent
  That may be true^[1]. But it doesn’t seem like a particularly useful answer?
  “The optimization target is the optimization target.”
  1. ^
    For the outer optimiser that builds the AI
  - James Camacho 24 Sep 2025 19:04 UTC
    1 point
    0
    Parent
    I think having all of this in mind as you train is actually pretty important. That way, when something doesn’t work, you know where to look:
    
    Am I exploring enough, or stuck always pulling the first lever? (free energy)
    Is it biased for some reason? (probably the metric)
    Is it stuck not improving? (step or batch size)
    
    Weight-initialization isn’t too helpful to think about yet (other than avoiding explosions at the very beginning of training, and maybe a little for transfer learning), but we’ll probably get hyper neural networks within a few years.
- Archimedes 25 Sep 2025 1:18 UTC
  3 points
  0
  Parent
  Would you like a zesty vinaigrette or just a sprinkling of more jargon on that word salad?
- Robert Cousineau 25 Sep 2025 6:54 UTC
  2 points
  0
  Parent
  Reward is not the optimization target (during pretraining).
  The optimization target (during pretraining) is the minimization of the empirical cross-entropy loss L = -∑log p(xᵢ|x₁,...,xᵢ₋₁), approximating the negative log-likelihood of the next-token prediction task under the autoregressive factorization p(x₁,...,xₙ)=∏p(xᵢ|x₁,...,xᵢ₋₁). The loss is computed over discrete tokens from subword vocabularies, averaged across sequences and batches, with gradient-based updates minimizing this singular objective. The optimization proceeds through multi-stage curricula: initial pretraining minimizing perplexity, followed by context-extension phases maintaining the same cross-entropy objective over longer sequences, and quality-annealing stages that reweight the loss toward higher-quality subsets while preserving the fundamental next-token prediction target.
  The post-training optimization target is maximizing expected reward (under distributional constraints). Supervised fine-tuning first minimizes cross-entropy loss on target completions from instruction-response pairs, with optional prompt-masking excluding input tokens from the loss computation. Subsequent alignment introduces the constrained objective max_π E_x~π[R(x)] - βD_KL[π(x)||π_ref(x)], balancing reward maximization against divergence from the reference policy. This manifests through varied algorithmic realizations: Proximal Policy Optimization maximizes the clipped surrogate objective L^CLIP(θ) = E[min(rₜ(θ)Âₜ, clip(rₜ(θ), 1-ε, 1+ε)Âₜ)]; Direct Preference Optimization reformulates to minimize -E[(x_w,x_l)~D][log σ(β log π(x_w)/π_ref(x_w) - β log π(x_l)/π_ref(x_l))]; best-of-N sampling maximizes E[R(x*)] where x* = argmax_{x∈{x₁,...,xₙ}} R(x); Rejection Sampling Fine-tuning minimizes cross-entropy on the subset {x : R(x) > τ}; Kahneman-Tversky Optimization targets E[w(R(x))log π(x)] with prospect-theoretic weighting; Odds Ratio Preference Optimization combines -log π(x_w) - λ log[π(x_w)/(π(x_w) + π(x_l))]. The reward functions R(x) themselves are learned objectives, typically parameterized by neural networks minimizing E[(x_w,x_l)~D][-log σ(r(x_w) - r(x_l))] under the Bradley-Terry preference model, with rewards sourced from human annotations, AI-generated preferences, or constitutional specifications encoded as differentiable objectives.
- jake_mendel 25 Sep 2025 1:14 UTC
  2 points
  0
  Parent
  What are solenoidal flux corrections in this context
- faul_sname 24 Sep 2025 18:40 UTC
  2 points
  0
  Parent
  
  for the weight-initialization distribution as prior
  
  The bits of that I understand seem accurate but also it is not possible in the general case to predict (without doing the training run) how a given random initialization will affect what the final model looks like.
  
  Which might have been the point you were trying to make, not sure.
- James Camacho 24 Sep 2025 18:53 UTC
  1 point
  −1
  Parent
  I like this take, especially it’s precision, though I disagree in a few places.
  
  conductance-corrected Wasserstein metric
  
  This is the wrong metric, but I won’t help you find the right one.
  
  the step-size effective loss potential critical batch size regime
  
  You can lower the step-size and increase the batch-size as you train to keep the perturbation bounded. Like, sure, you could claim an ODE solver doesn’t give you the exact solution, but adaptive methods let you get within any desired tolerance.
  
  for the weight-initialization distribution
  
  This is another “hyper”parameter to feed into the model. I agree that, at some point, the turtles have to stop, and we can call that the initial weight distribution, though I’d prefer the term ‘interpreter’.
  
  up to solenoidal flux corrections
  
  Hmm… you sure you’re using the right flux? Not all boundaries of boundaries are zero, and GANs (and self-play) probably use a 6-complex.
- [ ]
  [deleted]