There is enough natural text data until 2026-2028, as I describe in the Peak Data section of the linked post. It’s not very good data, but with 2,500x raw compute of original GPT-4 (and possibly 10,000x-25,000x in effective compute due to algorithmic improvement in pretraining), that’s a lot of headroom that doesn’t depend on inventing new things (such as synthetic data suitable for improving general intelligence through pretraining the way natural text data is).
Insufficient data could in principle be an issue with making good use of 5e28 FLOPs, but actually getting 5e28 FLOPs by 2028 (from a single training system) only requires funding. The decisions about this don’t need to be taken based on AIs that exist today, they’ll be taken based on AIs that exist in 2026-2027, trained on 1 GW training systems being built this year. With o3-like post-training, the utility and impressiveness of an LLM improves, so the chances of getting that project funded improve (compared to absence of such techniques).
There is enough natural text data until 2026-2028, as I describe in the Peak Data section of the linked post. It’s not very good data, but with 2,500x raw compute of original GPT-4 (and possibly 10,000x-25,000x in effective compute due to algorithmic improvement in pretraining), that’s a lot of headroom that doesn’t depend on inventing new things (such as synthetic data suitable for improving general intelligence through pretraining the way natural text data is).
Insufficient data could in principle be an issue with making good use of 5e28 FLOPs, but actually getting 5e28 FLOPs by 2028 (from a single training system) only requires funding. The decisions about this don’t need to be taken based on AIs that exist today, they’ll be taken based on AIs that exist in 2026-2027, trained on 1 GW training systems being built this year. With o3-like post-training, the utility and impressiveness of an LLM improves, so the chances of getting that project funded improve (compared to absence of such techniques).