Roman Malov comments on Weak-To-Strong Generalization

Roman Malov 12 Nov 2025 0:02 UTC
1 point
0
AF
such an assumption is not strong enough to guarantee that $P_{s}$ has a 50-50 coinflip hypothesis.
What do you mean by 50-50 hypothesis? Is it $E_{1 / 2}$ such that $\forall n : P (T_{h_{n}} | E_{1 / 2}) = P (T_{t_{n}} | E_{1 / 2}) = 1 / 2$ ? If that’s the case, it doesn’t seem fair to ask from a student to have such hypothesis: the task is to learn to imitate teacher, and teacher doesn’t actually perform behavior like HTHHT, it can be either perform either HHHHH or TTTTTT.
- abramdemski 11 May 2026 15:20 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I think my example was not well-chosen here, because this is confusing. Part of what I’m modeling here is sampling from the weak teacher’s model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
  Imagine that 10% of the weak teacher’s samples are misaligned, while 90% are aligned. The teacher has this “aligned” hypothesis and “misaligned” hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a “strong” student should learn to mimic this 90-10 behavior, but this requires the strong student to have a 90-10 hypothesis, not merely the “aligned” and “misaligned” hypotheses. Weak-to-strong generalization is a phenomenon in which the strong student just learns the “aligned” hypothesis instead, when exposed to the 90-10 data, because it fits best out of what the student can hypothesize.
  (However, this model probably undersells the importance of gradient descent to the phenomenon.)