FYI: this post was included in the Last Week in AI Podcast—nice work!
My team at Arcadia Impact is currently working on standardising evaluations
I notice that you cite instance-level results (2023 Burnell et al.) as being a recommendation. Is there anything else that you think the open-source ecosystem should be doing?
For example, we’re thinking about:
Standardised implementations of evaluations (i.e. Inspect Evals)
Aggregating and reporting results for evaluations (see beta release of the Inspect Evals Dashboard)
Tools for automated prompt variation and elicitation*
*This statement especially had me thinking about this: “These include using correct formats and better ways to parse answers from responses, using recommended sampling temperatures, using the same max_output_tokens, and using few-shot prompting to improve format-following.”
Here’s the feedback I gave to Jay, which he encouraged me to add as a comment:
(this is mostly just me playing devils advocate for the wall perspective)