I think that’s probably a mistake, the sentence you quoted seems to a hypothetical and the actual experimental results do seem to point against the effectiveness of current RL (?).
I am not confident though. It’s certainly true that if RL can increase the probably of a behavior/ability enough, it is not necessarily helpful to frame it as having already been in the base model’s distribution “for practical purposes.” I would have to look into this more carefully to judge whether the paper actually does a convincing job of demonstrating that this is a good frame.
I think that’s probably a mistake, the sentence you quoted seems to a hypothetical and the actual experimental results do seem to point against the effectiveness of current RL (?).
I am not confident though. It’s certainly true that if RL can increase the probably of a behavior/ability enough, it is not necessarily helpful to frame it as having already been in the base model’s distribution “for practical purposes.” I would have to look into this more carefully to judge whether the paper actually does a convincing job of demonstrating that this is a good frame.