The two broad paths to general intelligence—RL and LLMs—both had started to stall by the beginning of 2023.
As Chinchilla had shown, data is just as important as compute for training smarter models. The massive increase in performance in the behavior of LLM’s in prior years occurred because of a one-time increase of data—namely, training on nearly everything interesting that humans have ever written. Unless the amount of high quality human text could be increased by 10x, this leap in performance would never happen again. Attempts to improve the behavior of models by pulling text from YouTube with Whisper made the simulacra within the models much better Youtubers—but only infinitesimally marginally better agents. Given that even the largest language models struggle to learn long-tail knowledge reflected in fewer than 100 documents, the inability to 10x and 100x high-quality data proved an immense blocker.
Furthermore, data in text ultimately failed to capture many relevant-human skills, because many human skills are not captured by text. So in early 2024, GPT-4 was again an improvement over GPT-3, but was no closer to being a junior software developer than GPT-3 -- it could not read a Jira ticket, glance at the Figma files, ping the designer to resolve an ambiguity, make the changes and screenshot them for the PR, and so on. Massive bureaucracies and agglomerations with other models tried to accomplish all these tasks, but immediately introduced massive amounts of human engineering that didn’t work very well. So it remains good at copywriting and similar tasks but just isn’t that useful broadly.
RL similarly stalled. Though RL over massive collections of policy-optmized agents could produce behavior of reasonable generality, such RL methods proved very extremely inefficient at producing such generality. Despite things such as the 2019 DeepMind Starcraft victory, that had take in literal centuries of training time; and though more efficient algorithms could be applied toy problem such as Atari, nothing seemed to work in circumstances where essentially infinite data could not be generated.
More to the point, RL remained limited to (1) episodic circumstances over a (2) fixed distribution (2) using models that inevitably had no understanding the domain of the real world, and (3) operate in distinct training / inference modes. All this is different than an AGI would require, and very little progress had been made on these missing ingredients.
In the following years, DL would of course make a lot of progress. It would take many artists jobs. DeepMind achieved superhuman mathematical performance in 2026, which had a host of applications and transformed the field of mathematics. The successful and apparently true physics “theory of everything” was developed by a research laboratory in China in 2028. But, as with the case of drug discovery, image diffusion, and text generation, these successes happened in static domains, and carefully human-designed loss functions. DL which could deal with the real world remained absent.
“as the nuke began to detonate, an incredible coincidence happened. All the neutrons missed hitting further atoms of plutonium and the core fizzled out, leaving a few glowing masses of plutonium”.
It could happen but no one has even tried to give llms the apis to even access a jira.
The two broad paths to general intelligence—RL and LLMs—both had started to stall by the beginning of 2023.
As Chinchilla had shown, data is just as important as compute for training smarter models. The massive increase in performance in the behavior of LLM’s in prior years occurred because of a one-time increase of data—namely, training on nearly everything interesting that humans have ever written. Unless the amount of high quality human text could be increased by 10x, this leap in performance would never happen again. Attempts to improve the behavior of models by pulling text from YouTube with Whisper made the simulacra within the models much better Youtubers—but only infinitesimally marginally better agents. Given that even the largest language models struggle to learn long-tail knowledge reflected in fewer than 100 documents, the inability to 10x and 100x high-quality data proved an immense blocker.
Furthermore, data in text ultimately failed to capture many relevant-human skills, because many human skills are not captured by text. So in early 2024, GPT-4 was again an improvement over GPT-3, but was no closer to being a junior software developer than GPT-3 -- it could not read a Jira ticket, glance at the Figma files, ping the designer to resolve an ambiguity, make the changes and screenshot them for the PR, and so on. Massive bureaucracies and agglomerations with other models tried to accomplish all these tasks, but immediately introduced massive amounts of human engineering that didn’t work very well. So it remains good at copywriting and similar tasks but just isn’t that useful broadly.
RL similarly stalled. Though RL over massive collections of policy-optmized agents could produce behavior of reasonable generality, such RL methods proved very extremely inefficient at producing such generality. Despite things such as the 2019 DeepMind Starcraft victory, that had take in literal centuries of training time; and though more efficient algorithms could be applied toy problem such as Atari, nothing seemed to work in circumstances where essentially infinite data could not be generated.
More to the point, RL remained limited to (1) episodic circumstances over a (2) fixed distribution (2) using models that inevitably had no understanding the domain of the real world, and (3) operate in distinct training / inference modes. All this is different than an AGI would require, and very little progress had been made on these missing ingredients.
In the following years, DL would of course make a lot of progress. It would take many artists jobs. DeepMind achieved superhuman mathematical performance in 2026, which had a host of applications and transformed the field of mathematics. The successful and apparently true physics “theory of everything” was developed by a research laboratory in China in 2028. But, as with the case of drug discovery, image diffusion, and text generation, these successes happened in static domains, and carefully human-designed loss functions. DL which could deal with the real world remained absent.
“as the nuke began to detonate, an incredible coincidence happened. All the neutrons missed hitting further atoms of plutonium and the core fizzled out, leaving a few glowing masses of plutonium”.
It could happen but no one has even tried to give llms the apis to even access a jira.