Thanks, that’s a good point I should’ve made clearer- to clarify the blue data-points represent the average result per model (so each shown data points represents many participant scores). There’s two reasons then that the line of best fit of these may be slightly different to the drawn line which is the linear regression slope: I) all the models didn’t have exactly the same number of data points in so the regression weights the points differently, and II) the regression slope also controls for the fact that some models randomly received a larger share of easier/harder task difficulties.
One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.
Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it’s a null result, it’s hard to tell whether that’s strong evidence of an absence of effect, or whether there’s just too little data and not much evidence either way.
Thanks, that’s a good point I should’ve made clearer- to clarify the blue data-points represent the average result per model (so each shown data points represents many participant scores). There’s two reasons then that the line of best fit of these may be slightly different to the drawn line which is the linear regression slope: I) all the models didn’t have exactly the same number of data points in so the regression weights the points differently, and II) the regression slope also controls for the fact that some models randomly received a larger share of easier/harder task difficulties.
Thanks, that makes sense.
One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.
Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it’s a null result, it’s hard to tell whether that’s strong evidence of an absence of effect, or whether there’s just too little data and not much evidence either way.