shash42 comments on Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

shash42 30 May 2025 12:22 UTC
9 points
0
Hey, thanks for checking. The qwen2.5 MATH results are on the full MATH dataset so are not comparable here, as the Spurious rewards paper uses MATH500. The Hochlehnert et al. paper has the results on MATH500, so that is why we took it from there.

I do agree that ideally we should re-evaluate all models on the same more reliable evaluation setup. However to the best of our knowledge the papers have not released open-weight checkpoints. The most transparent way to fix all these issues is papers releasing sample-level outputs going forward, so its easy for people to figure out whats going on.

All this said, in the end, our main point is only: If changing inference hyperparameters can give higher accuracy, is RL improving “reasoning abilities”?
- Neel Nanda 30 May 2025 13:44 UTC
  10 points
  2
  Parent
  Gotcha, thanks for clarifying. Yeah that point seems reasonable. I initially thought this was a takedown re “those other works are likely BS”, I now think “the other works provide insufficient evidence to be confident in their conclusions”, does that seem right to you?
  - shash42 30 May 2025 14:52 UTC
    4 points
    0
    Parent
    Yes, that is a good takeaway!
- nikhilchandak 30 May 2025 12:40 UTC
  4 points
  3
  Parent
  Just to add, quite a few other papers like Absolute Zero and SimpleRL-Zoo which report on MATH500 also show that Qwen-2.5-MATH 7B has ~64% accuracy:
  From Absolute Zero (M500 column below -- 64.8):
  From SimpleRL Zoo (63.6):
  We reported numbers from Hochlehnert et al. as their paper was explicitly focused on reproducing model performance on various datasets.