Once AGI works, everything else is largely moot. Synthetic data is a likely next step absent AGI. It’s not currently used for pre-training at scale, there are still more straightforward things to be done like better data curation, augmentation of natural data, multimodality, and synthetic datasets for fine-tuning (rather than for the bulk of pre-training). It’s not obvious but plausible that even absent AGI it’s relatively straightforward to generate useful synthetic data with sufficiently good models trained on natural data, which leads to better models that generate better synthetic data.
This is not about making progress on ideas beyond current natural data (human culture), but about making models smarter despite horrible sample efficiency. If this is enough to get AGI, it’s unnecessary for synthetic data to make any progress on actual ideas until that point.
Results like Galactica (see Table 2 therein) illustrate how content of the dataset can influence the outcome, that’s the kind of thing I mean by higher quality datasets. You won’t find 20T natural tokens for training a 1T LLM that are like that, but it might be possible to generate them, and it might turn out that the results improve despite those tokens largely rehashing the same stuff that was in the original 100B tokens on similar topics. AFAIK the experiments to test this with better models (or scaling laws for this effect) haven’t been done/published yet. It’s possible that this doesn’t work at all, beyond some modest asymptote, no better than any of the other tricks currently being stacked.
Once AGI works, everything else is largely moot. Synthetic data is a likely next step absent AGI. It’s not currently used for pre-training at scale, there are still more straightforward things to be done like better data curation, augmentation of natural data, multimodality, and synthetic datasets for fine-tuning (rather than for the bulk of pre-training). It’s not obvious but plausible that even absent AGI it’s relatively straightforward to generate useful synthetic data with sufficiently good models trained on natural data, which leads to better models that generate better synthetic data.
This is not about making progress on ideas beyond current natural data (human culture), but about making models smarter despite horrible sample efficiency. If this is enough to get AGI, it’s unnecessary for synthetic data to make any progress on actual ideas until that point.
Results like Galactica (see Table 2 therein) illustrate how content of the dataset can influence the outcome, that’s the kind of thing I mean by higher quality datasets. You won’t find 20T natural tokens for training a 1T LLM that are like that, but it might be possible to generate them, and it might turn out that the results improve despite those tokens largely rehashing the same stuff that was in the original 100B tokens on similar topics. AFAIK the experiments to test this with better models (or scaling laws for this effect) haven’t been done/published yet. It’s possible that this doesn’t work at all, beyond some modest asymptote, no better than any of the other tricks currently being stacked.