We train LLMs not only on the artifacts from our best thinkers but, in 99.95% of cases, on web crawls, social media, and code.
Concluding from a paper that says in it’s abstract “commercial models rarely detail their data” that you know what makes up 99.95% of cases of training data, is a huge reasoning mistake.
Given public communication it’s also pretty clear that synthetic data is more than 0.05% of the data. Elon Musk already speaks about training a model that’s 100% synthetic data.
I agree that commercial models don’t detail their data, the point is to have an estimate. I guess, Soldaini et al., ‘Dolma’, made their best to collect the data, and we can assume commercial models have similar sources.
I agree that commercial models don’t detail their data, the point is to have an estimate.
That’s “I searched the key’s under the streetlight”. The keys are not under the streetlight.
I guess, Soldaini et al., ‘Dolma’, made their best to collect the data, and we can assume commercial models have similar sources.
Soldaini et al have far less capital to collect data than the big companies building models. On the other hand the big model companies can pay billions for their data. This means that they can license data sources that Soldaini et al can’t. It also means that they can spend a lot of capital on synthetic data.
Soldaini et al does not include libgen/Anna’s Archive but it’s likely that all of the big companies besides Google that has their own scans of all books that they use do. Antrophic paid out over a billion in the settlement for that copyright violation.
Even outside of paying for data and just using pirated data, the big companies have a lot of usage data. The most common example for syncopancy in AI models is that it’s due to the models optimizing for users clicking thumbs-up.
Concluding from a paper that says in it’s abstract “commercial models rarely detail their data” that you know what makes up 99.95% of cases of training data, is a huge reasoning mistake.
Given public communication it’s also pretty clear that synthetic data is more than 0.05% of the data. Elon Musk already speaks about training a model that’s 100% synthetic data.
I agree that commercial models don’t detail their data, the point is to have an estimate. I guess, Soldaini et al., ‘Dolma’, made their best to collect the data, and we can assume commercial models have similar sources.
That’s “I searched the key’s under the streetlight”. The keys are not under the streetlight.
Soldaini et al have far less capital to collect data than the big companies building models. On the other hand the big model companies can pay billions for their data. This means that they can license data sources that Soldaini et al can’t. It also means that they can spend a lot of capital on synthetic data.
Soldaini et al does not include libgen/Anna’s Archive but it’s likely that all of the big companies besides Google that has their own scans of all books that they use do. Antrophic paid out over a billion in the settlement for that copyright violation.
Even outside of paying for data and just using pirated data, the big companies have a lot of usage data. The most common example for syncopancy in AI models is that it’s due to the models optimizing for users clicking thumbs-up.