such an assumption is not strong enough to guarantee that Ps has a 50-50 coinflip hypothesis.
What do you mean by 50-50 hypothesis? Is it E1/2 such that ∀n:P(Thn|E1/2)=P(Ttn|E1/2)=1/2? If that’s the case, it doesn’t seem fair to ask from a student to have such hypothesis: the task is to learn to imitate teacher, and teacher doesn’t actually perform behavior like HTHHT, it can be either perform either HHHHH or TTTTTT.
I think my example was not well-chosen here, because this is confusing. Part of what I’m modeling here is sampling from the weak teacher’s model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
Imagine that 10% of the weak teacher’s samples are misaligned, while 90% are aligned. The teacher has this “aligned” hypothesis and “misaligned” hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a “strong” student should learn to mimic this 90-10 behavior, but this requires the strong student to have a 90-10 hypothesis, not merely the “aligned” and “misaligned” hypotheses. Weak-to-strong generalization is a phenomenon in which the strong student just learns the “aligned” hypothesis instead, when exposed to the 90-10 data, because it fits best out of what the student can hypothesize.
(However, this model probably undersells the importance of gradient descent to the phenomenon.)
What do you mean by 50-50 hypothesis? Is it E1/2 such that ∀n:P(Thn|E1/2)=P(Ttn|E1/2)=1/2? If that’s the case, it doesn’t seem fair to ask from a student to have such hypothesis: the task is to learn to imitate teacher, and teacher doesn’t actually perform behavior like HTHHT, it can be either perform either HHHHH or TTTTTT.
I think my example was not well-chosen here, because this is confusing. Part of what I’m modeling here is sampling from the weak teacher’s model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
Imagine that 10% of the weak teacher’s samples are misaligned, while 90% are aligned. The teacher has this “aligned” hypothesis and “misaligned” hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a “strong” student should learn to mimic this 90-10 behavior, but this requires the strong student to have a 90-10 hypothesis, not merely the “aligned” and “misaligned” hypotheses. Weak-to-strong generalization is a phenomenon in which the strong student just learns the “aligned” hypothesis instead, when exposed to the 90-10 data, because it fits best out of what the student can hypothesize.
(However, this model probably undersells the importance of gradient descent to the phenomenon.)