just read through it, i agree there are clear similarities in the data generation and preparation process but how it’s then actually used to train the model seems quite different imo—they employ RL, particularly even GRPO, which I want to avoid by any cost. Specifically, because they don’t use any kind of logit aggregation, they face this issue of “impossible knowledge” or in other words, a terrible teacher. They reconcile this using their KL and log prob reward function but since at this point they are already dealing with traces, they lost out on the rich logit representation.
also, they choose teacher-forcing instead of student-forcing; teacher-forcing is definitely more popular in the literature, but I think what works like R1 showed is that staying on-policy is really key. here even more, the teacher will have something that we can call the “teaching-bias” since it has already seen the solution and if we allow this bias to carry-over throughout generation, all traces will be terrible and the reward function won’t particularly fix this.
still definitely nice work, thanks for the link, decent read.
This idea appears very similar to the paper “Reinforcement Learning Teachers of Test Time Scaling”: https://arxiv.org/abs/2506.08388
just read through it, i agree there are clear similarities in the data generation and preparation process but how it’s then actually used to train the model seems quite different imo—they employ RL, particularly even GRPO, which I want to avoid by any cost. Specifically, because they don’t use any kind of logit aggregation, they face this issue of “impossible knowledge” or in other words, a terrible teacher. They reconcile this using their KL and log prob reward function but since at this point they are already dealing with traces, they lost out on the rich logit representation.
also, they choose teacher-forcing instead of student-forcing; teacher-forcing is definitely more popular in the literature, but I think what works like R1 showed is that staying on-policy is really key. here even more, the teacher will have something that we can call the “teaching-bias” since it has already seen the solution and if we allow this bias to carry-over throughout generation, all traces will be terrible and the reward function won’t particularly fix this.
still definitely nice work, thanks for the link, decent read.