Yup, I expected that OP would generally agree with my comment.
First off, you just posted them online
They only posted three questions, out of at least 62 (=1/(.2258-.2097)), perhaps much more than 62. For all I know, they removed those three from the pool when they shared them. That’s what I would do—probably some human will publicly post the answers soon enough. I dunno. But even if they didn’t remove those three questions from the pool, it’s a small fraction of the total.
You point out that all the questions would be in the LLM company user data, after kagi has run the benchmark once (unless kagi changes out all their questions each time, which I don’t think they do, although they do replace easier questions with harder questions periodically). Well:
If an LLM company is training on user data, they’ll get the questions without the answers, which probably wouldn’t make any appreciable difference to the LLM’s ability to answer them;
If an LLM company is sending user data to humans as part of RLHF or SFT or whatever, then yes there’s a chance for ground truth answers to sneak in that way—but that’s extremely unlikely to happen, because companies can only afford to send an extraordinarily small fraction of user data to actual humans.
Yup, I expected that OP would generally agree with my comment.
They only posted three questions, out of at least 62 (=1/(.2258-.2097)), perhaps much more than 62. For all I know, they removed those three from the pool when they shared them. That’s what I would do—probably some human will publicly post the answers soon enough. I dunno. But even if they didn’t remove those three questions from the pool, it’s a small fraction of the total.
You point out that all the questions would be in the LLM company user data, after kagi has run the benchmark once (unless kagi changes out all their questions each time, which I don’t think they do, although they do replace easier questions with harder questions periodically). Well:
If an LLM company is training on user data, they’ll get the questions without the answers, which probably wouldn’t make any appreciable difference to the LLM’s ability to answer them;
If an LLM company is sending user data to humans as part of RLHF or SFT or whatever, then yes there’s a chance for ground truth answers to sneak in that way—but that’s extremely unlikely to happen, because companies can only afford to send an extraordinarily small fraction of user data to actual humans.