Another reason this could be misleading is that it’s possible that sampling from the model pre-mode-collapse performs better at high k even if RL did teach the model new capabilities. In general, taking a large number of samples from a slightly worse model before mode collapse often outperforms taking a large number of samples from a better model after mode collapse — even if the RL model is actually more powerful (in the sense that its best reasoning pathways are better than the original model’s best reasoning pathways). In other words, the RL model can both be much smarter and still perform worse on these evaluations due to diversity loss. But it’s possible that if you’re, say, trying to train a model to accomplish a difficult/novel reasoning task, it’s much better to start with the smarter RL model than the more diverse original model.
Another reason this could be misleading is that it’s possible that sampling from the model pre-mode-collapse performs better at high k even if RL did teach the model new capabilities. In general, taking a large number of samples from a slightly worse model before mode collapse often outperforms taking a large number of samples from a better model after mode collapse — even if the RL model is actually more powerful (in the sense that its best reasoning pathways are better than the original model’s best reasoning pathways). In other words, the RL model can both be much smarter and still perform worse on these evaluations due to diversity loss. But it’s possible that if you’re, say, trying to train a model to accomplish a difficult/novel reasoning task, it’s much better to start with the smarter RL model than the more diverse original model.