Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don’t say how big of a difference it made.
I’ve been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I’d be curious if this sounds right to you, or if you think it could make a substantially bigger difference.
10x seems reasonable on its face, but honestly I have no idea. We haven’t really dealt with scales and feature learners like this before. I assume a big part of what the model is doing is learning good representations that allow it to learn more/better from each example as training goes on. Given that, I can imagine arguments either way. On one hand, good representations could mean the model is discerning on its own what’s important (so maybe data cleaning doesn’t matter much). On the other, maybe noisy data (say, with lots of irreducible entropy—though that’s not necessarily what “garbage text” looks like, indeed often the opposite, but I guess it depends how you filter in practice) could take up disproportionately large amounts of model capacity & training signal as representations of “good” (ie compressible) data get better, thereby adding a bunch of noise to training and slowing it down. These are just random intuitive guesses though. Seems like an empirical question and might depend a lot on the details.
Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don’t say how big of a difference it made.
I’ve been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I’d be curious if this sounds right to you, or if you think it could make a substantially bigger difference.
10x seems reasonable on its face, but honestly I have no idea. We haven’t really dealt with scales and feature learners like this before. I assume a big part of what the model is doing is learning good representations that allow it to learn more/better from each example as training goes on. Given that, I can imagine arguments either way. On one hand, good representations could mean the model is discerning on its own what’s important (so maybe data cleaning doesn’t matter much). On the other, maybe noisy data (say, with lots of irreducible entropy—though that’s not necessarily what “garbage text” looks like, indeed often the opposite, but I guess it depends how you filter in practice) could take up disproportionately large amounts of model capacity & training signal as representations of “good” (ie compressible) data get better, thereby adding a bunch of noise to training and slowing it down. These are just random intuitive guesses though. Seems like an empirical question and might depend a lot on the details.