They don’t need to use this kind of subterfuge, they can just directly hire people to do that. Hiring experts to design benchmark questions is standard; this would be no different.
Yeah, my comment was mostly being silly. The grain of validity I think is there is that you probably get a much wider weirder set of testing from inviting in a larger and more diverse set of people. And for something like, ‘finding examples of strange failure cases that you yourself wouldn’t have thought of’ then I think diversity of testers matters quite a bit.
The current FrontierMath fracas is a case in point. Did OpenAI have to keep its sponsorship or privileged access secret? No. Surely there was some amount of money that would pay mathematicians to make hard problems, and that amount was not much different from what they did pay Epoch AI. Did that make life easier? Given the number of mathematician-participants saying they would’ve had second thoughts about participating had they known OA was involved, almost surely.
They don’t need to use this kind of subterfuge, they can just directly hire people to do that. Hiring experts to design benchmark questions is standard; this would be no different.
Yeah, my comment was mostly being silly. The grain of validity I think is there is that you probably get a much wider weirder set of testing from inviting in a larger and more diverse set of people. And for something like, ‘finding examples of strange failure cases that you yourself wouldn’t have thought of’ then I think diversity of testers matters quite a bit.
The current FrontierMath fracas is a case in point. Did OpenAI have to keep its sponsorship or privileged access secret? No. Surely there was some amount of money that would pay mathematicians to make hard problems, and that amount was not much different from what they did pay Epoch AI. Did that make life easier? Given the number of mathematician-participants saying they would’ve had second thoughts about participating had they known OA was involved, almost surely.