Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online.
Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren’t). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can… still train on the new question then and there just like when you did the benchmark. So it’s good to do so.
It’s the answers or labels being online which is the bad thing. But Byrnes’s comment and the linked Kagi page does not contain the answers to those 3 questions, as far as I can see.
I expect it matters to the extent we care about whether the generalizing to the new question is taking place in the expensive pretraining phase, or in the active in-context phase.
Sure fair point! But generally people gossiping online about missed benchmark questions, and then likely spoiling the answers means that a question is now ~ruined for all training runs. How much of these modest benchmark improvements overtime can be attributed to this?
The fact that frontier AIs can basically see and regurgitate everything ever written on the entire internet is hard to fathom!
I could be really petty here and spoil these answers for all future training runs (and make all future models look modestly better), but I just joined this site so I’ll resist lmao …
Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren’t). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can… still train on the new question then and there just like when you did the benchmark. So it’s good to do so.
It’s the answers or labels being online which is the bad thing. But Byrnes’s comment and the linked Kagi page does not contain the answers to those 3 questions, as far as I can see.
I expect it matters to the extent we care about whether the generalizing to the new question is taking place in the expensive pretraining phase, or in the active in-context phase.
Sure fair point! But generally people gossiping online about missed benchmark questions, and then likely spoiling the answers means that a question is now ~ruined for all training runs. How much of these modest benchmark improvements overtime can be attributed to this?
The fact that frontier AIs can basically see and regurgitate everything ever written on the entire internet is hard to fathom!
I could be really petty here and spoil these answers for all future training runs (and make all future models look modestly better), but I just joined this site so I’ll resist lmao …