I did a spot check on your claims about the Qwen 2.5 Math MATH results in the spurious reward paper. They claim 49% while you claim 64.3%. As I understand it, you’re taking your number from Hochlehnert et al. But the original Qwen 2.5 Math paper only seems to give the base result of 55.4 with 4 shot CoT, so I find 49% totally believable. Though they get 80%+ with instruct.
Is your claim that the original Qwen paper screwed up evals too? Or have I misread something?
More generally, it seems like the key thing to check to support your claims is whether applying your favourite eval method to both the base model and the model weights produced by these papers produces such a diff—unless I missed this part, your results seem equally consistent with a better eval harness adding a flat 15% to everything, or only doing so before spurious training
Though I do agree that, given that the performance delta from base to instruct is so much larger there is a lot of room for spurious explanations
Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.
I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.
All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?
Gotcha, thanks for clarifying. Yeah that point seems reasonable. I initially thought this was a takedown re “those other works are likely BS”, I now think “the other works provide insufficient evidence to be confident in their conclusions”, does that seem right to you?
I did a spot check on your claims about the Qwen 2.5 Math MATH results in the spurious reward paper. They claim 49% while you claim 64.3%. As I understand it, you’re taking your number from Hochlehnert et al. But the original Qwen 2.5 Math paper only seems to give the base result of 55.4 with 4 shot CoT, so I find 49% totally believable. Though they get 80%+ with instruct.
Is your claim that the original Qwen paper screwed up evals too? Or have I misread something?
More generally, it seems like the key thing to check to support your claims is whether applying your favourite eval method to both the base model and the model weights produced by these papers produces such a diff—unless I missed this part, your results seem equally consistent with a better eval harness adding a flat 15% to everything, or only doing so before spurious training
Though I do agree that, given that the performance delta from base to instruct is so much larger there is a lot of room for spurious explanations
Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.
I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.
All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?
Gotcha, thanks for clarifying. Yeah that point seems reasonable. I initially thought this was a takedown re “those other works are likely BS”, I now think “the other works provide insufficient evidence to be confident in their conclusions”, does that seem right to you?
Yes, that is a good takeaway!
Just to add, quite a few other papers like Absolute Zero and SimpleRL-Zoo which report on MATH500 also show that Qwen-2.5-MATH 7B has ~64% accuracy:
From Absolute Zero (M500 column below -- 64.8):
From SimpleRL Zoo (63.6):
We reported numbers from Hochlehnert et al. as their paper was explicitly focused on reproducing model performance on various datasets.