Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.
I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.
All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?
Gotcha, thanks for clarifying. Yeah that point seems reasonable. I initially thought this was a takedown re “those other works are likely BS”, I now think “the other works provide insufficient evidence to be confident in their conclusions”, does that seem right to you?
Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.
I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.
All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?
Gotcha, thanks for clarifying. Yeah that point seems reasonable. I initially thought this was a takedown re “those other works are likely BS”, I now think “the other works provide insufficient evidence to be confident in their conclusions”, does that seem right to you?
Yes, that is a good takeaway!
Just to add, quite a few other papers like Absolute Zero and SimpleRL-Zoo which report on MATH500 also show that Qwen-2.5-MATH 7B has ~64% accuracy:
From Absolute Zero (M500 column below -- 64.8):
From SimpleRL Zoo (63.6):
We reported numbers from Hochlehnert et al. as their paper was explicitly focused on reproducing model performance on various datasets.