It sounds like they don’t filter out the canary string from general web text even though it would be cheap and filter out a really tiny proportion of the training data. Maybe that tiny proportion yields a disproportionate boost in performance. Hmm, wonder why that could be.
Well of course it yields a disproportionate boost, it contains information about the benchmarks, even if it doesn’t contain the exact questions. Unless that was intended subtext and you were being sarcastic?
It sounds like they don’t filter out the canary string from general web text even though it would be cheap and filter out a really tiny proportion of the training data. Maybe that tiny proportion yields a disproportionate boost in performance. Hmm, wonder why that could be.
Well of course it yields a disproportionate boost, it contains information about the benchmarks, even if it doesn’t contain the exact questions. Unless that was intended subtext and you were being sarcastic?
Yep, sarcastic. Sorry, someday I’ll learn not to do that on the Internet, but I’m not holding my breath.