Justin Olive comments on Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

Justin Olive 22 Jun 2025 9:54 UTC
2 points
0
FYI: this post was included in the Last Week in AI Podcast—nice work!
My team at Arcadia Impact is currently working on standardising evaluations
I notice that you cite instance-level results (2023 Burnell et al.) as being a recommendation. Is there anything else that you think the open-source ecosystem should be doing?

For example, we’re thinking about:
- Standardised implementations of evaluations (i.e. Inspect Evals)
- Aggregating and reporting results for evaluations (see beta release of the Inspect Evals Dashboard)
- Tools for automated prompt variation and elicitation*
*This statement especially had me thinking about this: “These include using correct formats and better ways to parse answers from responses, using recommended sampling temperatures, using the same max_output_tokens, and using few-shot prompting to improve format-following.”
- shash42 6 Jul 2025 11:57 UTC
  2 points
  0
  Parent
  Thanks these are some great ideas. another thing you guys might want to look into is shifting away from mcqs towards answer matching evaluations: https://www.lesswrong.com/posts/Qss7pWyPwCaxa3CvG/new-paper-it-is-time-to-move-on-from-mcqs-for-llm