Yep. For empirical work I’m in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. “did it get the correct answer with very high reliability”) as opposed to “did it outperform a baseline by a statistically significant margin” where you then end up needing high n and therefore each example needs to be cheap / shallow
I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you’re up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of “have a high standard for efficacy” was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.
Yep. For empirical work I’m in favor of experiments with more informed + well-trained human judges who engage deeply etc, and having a high standard for efficacy (e.g. “did it get the correct answer with very high reliability”) as opposed to “did it outperform a baseline by a statistically significant margin” where you then end up needing high n and therefore each example needs to be cheap / shallow
I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you’re up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of “have a high standard for efficacy” was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.
Yep, happy to chat!