FYI: this post was included in the Last Week in AI Podcast—nice work!
My team at Arcadia Impact is currently working on standardising evaluations
I notice that you cite instance-level results (2023 Burnell et al.) as being a recommendation. Is there anything else that you think the open-source ecosystem should be doing?
For example, we’re thinking about:
Standardised implementations of evaluations (i.e. Inspect Evals)
Aggregating and reporting results for evaluations (see beta release of the Inspect Evals Dashboard)
Tools for automated prompt variation and elicitation*
*This statement especially had me thinking about this: “These include using correct formats and better ways to parse answers from responses, using recommended sampling temperatures, using the same max_output_tokens, and using few-shot prompting to improve format-following.”
FYI: this post was included in the Last Week in AI Podcast—nice work!
My team at Arcadia Impact is currently working on standardising evaluations
I notice that you cite instance-level results (2023 Burnell et al.) as being a recommendation. Is there anything else that you think the open-source ecosystem should be doing?
For example, we’re thinking about:
Standardised implementations of evaluations (i.e. Inspect Evals)
Aggregating and reporting results for evaluations (see beta release of the Inspect Evals Dashboard)
Tools for automated prompt variation and elicitation*
*This statement especially had me thinking about this: “These include using correct formats and better ways to parse answers from responses, using recommended sampling temperatures, using the same max_output_tokens, and using few-shot prompting to improve format-following.”
Thanks these are some great ideas. another thing you guys might want to look into is shifting away from mcqs towards answer matching evaluations: https://www.lesswrong.com/posts/Qss7pWyPwCaxa3CvG/new-paper-it-is-time-to-move-on-from-mcqs-for-llm