faul_sname comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

faul_sname 5 May 2025 23:29 UTC
2 points
0
They test the perplexity (Below: Figure 6 left) of the RLed model’s generations (pink bar) relative to the base model. They find it is lower than the base model’s perplexity (turquoise bar), which “suggests that the responses from RL-trained models are highly likely to be generated by the base model” conditioned on the task prompt. (Perplexity higher than the base model would imply the RLed model had either new capabilities or higher diversity than the base model.)

Does “lower-than-base-model perplexity” suggest “likely to be generated by the base model conditioned on the task prompt”? Naively I would expect that lower perplexity according to the base model just means less information per response token, which could happen if the RL-trained model took more words to say the same thing. For example, if the reasoning models have a tendency to restate substrings of the original question verbatim, the per-token perplexity on those substrings will be very close to zero, and so the average perplexity of the RL’d model outputs would be expected to be lower than base model outputs even if the load-bearing outputs of the RL’d model contained surprising-to-the-base-model reasoning.

Still, this is research out of tsinghua and I am a hobbyist, I’m probably misunderstanding something.