Am I correctly understanding that the effect size shown in the graph is very small? It seems like the mean harmfulness score is not much higher for any of the evals, even if the effect size is technically statistically significant.
Am I correctly understanding that the effect size shown in the graph is very small? It seems like the mean harmfulness score is not much higher for any of the evals, even if the effect size is technically statistically significant.