Once upon a time I worked on language models and we trained on data that was correctly split from tuning data that was correctly split from test data.
And then we sent our results to the QA team who had their own data, and if their results were not good enough, we tried again. Good enough meant “enough lift over previous benchmarks”. So back and forth we went until QA reported success. On their dataset. Their unchanging test dataset.
But clearly since we correctly split all of our data, and since we could not see the contents of QA’s test dataset, no leakage could be occurring.
But of course you were engaged in meta-overfitting by the constant attack on the test dataset… How did you wind up detecting the leakage? Bad results when deployed to the real world?
Not to toot my own horn* but we detected it when I was given the project of turning some of our visualizations into something that could accept QA’s format so they could look at their results using those visualizations and then I was like ”… so how does QA work here, exactly? Like what’s the process?”
I do not know the real-world impact of fixing the overfitting.
*tooting one’s own horn always follows this phrase
Once upon a time I worked on language models and we trained on data that was correctly split from tuning data that was correctly split from test data.
And then we sent our results to the QA team who had their own data, and if their results were not good enough, we tried again. Good enough meant “enough lift over previous benchmarks”. So back and forth we went until QA reported success. On their dataset. Their unchanging test dataset.
But clearly since we correctly split all of our data, and since we could not see the contents of QA’s test dataset, no leakage could be occurring.
But of course you were engaged in meta-overfitting by the constant attack on the test dataset… How did you wind up detecting the leakage? Bad results when deployed to the real world?
Not to toot my own horn* but we detected it when I was given the project of turning some of our visualizations into something that could accept QA’s format so they could look at their results using those visualizations and then I was like ”… so how does QA work here, exactly? Like what’s the process?”
I do not know the real-world impact of fixing the overfitting.
*tooting one’s own horn always follows this phrase