There’s AGI, autonomous agency and a wide variety of open-ended objectives, and generation of synthetic data, preventing natural tokens from running out, both for quantity and quality. My impression is that the latter is likely to start happening by the time GPT-5 rolls out.
It appears this situation could be more accurately attributed to Human constraints rather than AI limitations? Upon reaching a stage where AI systems, such as GPT models, can absorbed all human-generated information, conversations, images, videos, discoveries, and insights, these systems should begin to pioneer their own discoveries and understandings?
While we can expect Humans to persist (hopefully) and continue generating more conversations, viewpoints, and data for AI to learn from, AI’s growth and learning shouldn’t necessarily be confined to the pace or scale of Human discoveries and data. They should be capable of progressing beyond the point where Human contribution slows, continuing to create their own discoveries, dialogues, reflections, and more to foster continuous learning and training?
Quality training data might be even more terrifying than scaling, Leela Zero plays superhuman Go at only 50M parameters, so who known what happens when 100B parameter LLMs start getting increasingly higher quality datasets for pre-training.
Where would these “higher quality datasets” come from? Do they already exist? And, if so, why are they not being used already?
Once AGI works, everything else is largely moot. Synthetic data is a likely next step absent AGI. It’s not currently used for pre-training at scale, there are still more straightforward things to be done like better data curation, augmentation of natural data, multimodality, and synthetic datasets for fine-tuning (rather than for the bulk of pre-training). It’s not obvious but plausible that even absent AGI it’s relatively straightforward to generate useful synthetic data with sufficiently good models trained on natural data, which leads to better models that generate better synthetic data.
This is not about making progress on ideas beyond current natural data (human culture), but about making models smarter despite horrible sample efficiency. If this is enough to get AGI, it’s unnecessary for synthetic data to make any progress on actual ideas until that point.
Results like Galactica (see Table 2 therein) illustrate how content of the dataset can influence the outcome, that’s the kind of thing I mean by higher quality datasets. You won’t find 20T natural tokens for training a 1T LLM that are like that, but it might be possible to generate them, and it might turn out that the results improve despite those tokens largely rehashing the same stuff that was in the original 100B tokens on similar topics. AFAIK the experiments to test this with better models (or scaling laws for this effect) haven’t been done/published yet. It’s possible that this doesn’t work at all, beyond some modest asymptote, no better than any of the other tricks currently being stacked.
It appears this situation could be more accurately attributed to Human constraints rather than AI limitations? Upon reaching a stage where AI systems, such as GPT models, can absorbed all human-generated information, conversations, images, videos, discoveries, and insights, these systems should begin to pioneer their own discoveries and understandings?
While we can expect Humans to persist (hopefully) and continue generating more conversations, viewpoints, and data for AI to learn from, AI’s growth and learning shouldn’t necessarily be confined to the pace or scale of Human discoveries and data. They should be capable of progressing beyond the point where Human contribution slows, continuing to create their own discoveries, dialogues, reflections, and more to foster continuous learning and training?
Where would these “higher quality datasets” come from? Do they already exist? And, if so, why are they not being used already?
Once AGI works, everything else is largely moot. Synthetic data is a likely next step absent AGI. It’s not currently used for pre-training at scale, there are still more straightforward things to be done like better data curation, augmentation of natural data, multimodality, and synthetic datasets for fine-tuning (rather than for the bulk of pre-training). It’s not obvious but plausible that even absent AGI it’s relatively straightforward to generate useful synthetic data with sufficiently good models trained on natural data, which leads to better models that generate better synthetic data.
This is not about making progress on ideas beyond current natural data (human culture), but about making models smarter despite horrible sample efficiency. If this is enough to get AGI, it’s unnecessary for synthetic data to make any progress on actual ideas until that point.
Results like Galactica (see Table 2 therein) illustrate how content of the dataset can influence the outcome, that’s the kind of thing I mean by higher quality datasets. You won’t find 20T natural tokens for training a 1T LLM that are like that, but it might be possible to generate them, and it might turn out that the results improve despite those tokens largely rehashing the same stuff that was in the original 100B tokens on similar topics. AFAIK the experiments to test this with better models (or scaling laws for this effect) haven’t been done/published yet. It’s possible that this doesn’t work at all, beyond some modest asymptote, no better than any of the other tricks currently being stacked.