Yes, this is point #1 from my recent Quick Take. Another interesting point is that there are no confidence intervals on the accuracy numbers—it looks like they only ran the questions once in each model, so we don’t know how much random variation might account for the differences between accuracy numbers. [Note added 2-3-25: I’m not sure why it didn’t make the paper, but Scale AI does report confidence intervals on their website.]
Yes, this is point #1 from my recent Quick Take. Another interesting point is that there are no confidence intervals on the accuracy numbers—it looks like they only ran the questions once in each model, so we don’t know how much random variation might account for the differences between accuracy numbers. [Note added 2-3-25: I’m not sure why it didn’t make the paper, but Scale AI does report confidence intervals on their website.]