Vladimir_Nesov comments on Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Vladimir_Nesov 6 May 2025 1:16 UTC
4 points
1

If you’re referring to the ARC-AGI results, it was just pass@1024

It was sampling 1024 reasoning traces, but I don’t expect it was being scored as pass@1024. An appropriate thing to happen there would be best-of-k or majority voting, not running reasoning 1024 times and seeing if at least one result happens to be correct.