As much as I agree that things are about to get really weird, that first diagram is a bit too optimistic. There is a limit to how much data humanity has available to train AI (here), and it seems doubtful we can make use of data x1000 times more effectively in such a short span of time. For all we know, there could be yet another AI winter coming—I don’t think we will get that lucky, though.
While there is a limit to the current text datasets, and expanding that with high quality human-generated text would be expensive, I’m afraid that’s not going to be a blocker.
Multimodal training already completely bypasses text-only limitations. Beyond just extracting text tokens from youtube, the video/audio itself could be used as training data. The informational richness relative to text seems to be very high.
Further, as gato demonstrates, there’s nothing stopping one model from spanning hundreds of distinct tasks, and many of those tasks can come from infinite data fountains, like simulations. Learning rigid body physics in isolation isn’t going to teach a model english, but if it’s one of a few thousand other tasks, it could augment the internal model into something more general. (There’s a paper that I have unfortunately forgotten the name of that created a sufficiently large set of permuted tasks that the model could not actually learn to perform each task, and instead had to learn what the task was within the context window. It worked. Despite being toy task permutations, I suspect something like this generalizes at sufficient scale.)
And it appears that sufficiently capable models can refine themselves in various ways. At the moment, the refinement doesn’t cause a divergence in capability, but that’s no guarantee as the models improve.
This suggests that without much more data we don’t get much better token prediction, but arguably modern quality of token prediction is already more than sufficient for AGI, we’ve reached token prediction overhang. It’s some other things that are missing, which won’t be resolved with better token prediction. (And it seems there are still ways of improving token prediction a fair bit, but again possibly irrelevant for timelines.)
As much as I agree that things are about to get really weird, that first diagram is a bit too optimistic. There is a limit to how much data humanity has available to train AI (here), and it seems doubtful we can make use of data x1000 times more effectively in such a short span of time. For all we know, there could be yet another AI winter coming—I don’t think we will get that lucky, though.
While there is a limit to the current text datasets, and expanding that with high quality human-generated text would be expensive, I’m afraid that’s not going to be a blocker.
Multimodal training already completely bypasses text-only limitations. Beyond just extracting text tokens from youtube, the video/audio itself could be used as training data. The informational richness relative to text seems to be very high.
Further, as gato demonstrates, there’s nothing stopping one model from spanning hundreds of distinct tasks, and many of those tasks can come from infinite data fountains, like simulations. Learning rigid body physics in isolation isn’t going to teach a model english, but if it’s one of a few thousand other tasks, it could augment the internal model into something more general. (There’s a paper that I have unfortunately forgotten the name of that created a sufficiently large set of permuted tasks that the model could not actually learn to perform each task, and instead had to learn what the task was within the context window. It worked. Despite being toy task permutations, I suspect something like this generalizes at sufficient scale.)
And it appears that sufficiently capable models can refine themselves in various ways. At the moment, the refinement doesn’t cause a divergence in capability, but that’s no guarantee as the models improve.
Very insightful, thanks for the clarification, as dooming as it is.
This suggests that without much more data we don’t get much better token prediction, but arguably modern quality of token prediction is already more than sufficient for AGI, we’ve reached token prediction overhang. It’s some other things that are missing, which won’t be resolved with better token prediction. (And it seems there are still ways of improving token prediction a fair bit, but again possibly irrelevant for timelines.)