Thanks for the post! I agree on many of the points, but have a few additions:
First, using “LLM” instead of something like “Claude Code” throughout the piece feels it doesn’t capture the current capabilities of AI systems. An “LLM in a loop” a.k.a “AI Agent” e.g. “Claude Code” is a different animal than base LLMs. Through search capabilities, AI Agents have much longer effective context windows. Through coding, they can have much greater reliability and precision in things like math. Through queries to API’s, they can have much more real time data. By writing to output files, they can track progress, thoughts, and ideas much more effectively. And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows. Getting this knowledge back into the base LLM isn’t easy today like you mentioned. Since the local instance of Claude Code is the only AI who knows about the finding. But you can imagine a future when such findings are automatically added to the dataset for the next pretraining run. Slow and expensive, but technically recursive self improvement? Yes.
Second, I think there are already success cases of RL-trained models today pursuing goals, but not in a completely ruthlessly way. I see it all the time working with coding agents. You can give them a task, and they write a bunch of code, try it, and sometimes succeed at the task. But they don’t always hack the reward function. For example, if they are trying to pass a unit test, they could just modify the test. This does occasionally happen. But the majority of the time, they understand my intent (or at least have been encourage via RL not to modify the success criteria) and thus complete the task how I desire, instead of the quick, easy, wrong solution. There is a common theme here in machine learning. You don’t want the model to cheat or shortcut. And you can get systems to be successful even when they know a cheat exists. As long as the training process sufficiently encourages such behavior. While I think this is reason for optimism, I wholeheartedly agree there are Major problems that yet exist. Like you mentioned, maintaining these desires for the “correct” solution instead of shortcuts across many model updates is likely hard. Also, the “correct” solution gets harder and harder to define as the complexity of the tradeoffs and system get more complicated. You could imagine a future AI Agent falling into a sub-optimal solution that no human has thought of. But the AI agent thinks it’s acceptable since so human ever mentioned ruling it out. The solution space is just so large, a notion of ethics or correctness needs to generalize beyond the training data we can provide. And beyond that there are still doom scenarios where an accident destroys the world. Maybe the AI is conducting some nuclear research to improve clean power, and it makes an unintended mistake that leads to a runway reaction.
The claim “modern LLMs can pursue goals and act like agents” does not contradict the claim “modern LLMs get their capabilities primarily from imitative learning”, right? Because there are examples in the pretraining data (including in the specially-commissioned proprietary expert-created data) where human-created text enacts the pursuit of goals. Right? See also here.
And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows.
I do think LLMs struggle when they depart from what’s in the pretraining data, but the meaning of that is a bit tricky to pin down. Like, if I ask you to “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”, you can do that in a fraction of a second, and correctly answer follow-up questions about that scenario, even though this specific mental image has never happened before in the history of the world. And LLMs can do that kind of thing too. Is that “departing from what’s in the pretraining data”? My answer is: No, not in the sense that matters. When you say your LLM “make[s] conclusions that no human knows”, I suspect it’s a similar kind of thing: it’s not “departing from the pretraining data” in the sense that matters. Indeed, anything that a third party can simply read and immediately understand is not “departing from the pretraining data” in the sense that matters, even if the person didn’t already know it.
By contrast, if you don’t know linear algebra, you can’t simply read a linear algebra textbook and immediately understand it. You need to spend many days and weeks internalizing these new ideas.
Anyway, in the post, I tried to be maximally clear-cut, by using the example of how billions of humans over thousands of years inventing language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch, without angels dropping new training data from the heavens. I very strongly don’t believe that billions of LLMs over thousands of years, in a sealed datacenter without any human intervention or human data, could do that. That would be real departure from the pretraining data. See also here and the first section here.
Thanks for the post! I agree on many of the points, but have a few additions:
First, using “LLM” instead of something like “Claude Code” throughout the piece feels it doesn’t capture the current capabilities of AI systems. An “LLM in a loop” a.k.a “AI Agent” e.g. “Claude Code” is a different animal than base LLMs. Through search capabilities, AI Agents have much longer effective context windows. Through coding, they can have much greater reliability and precision in things like math. Through queries to API’s, they can have much more real time data. By writing to output files, they can track progress, thoughts, and ideas much more effectively. And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows. Getting this knowledge back into the base LLM isn’t easy today like you mentioned. Since the local instance of Claude Code is the only AI who knows about the finding. But you can imagine a future when such findings are automatically added to the dataset for the next pretraining run. Slow and expensive, but technically recursive self improvement? Yes.
Second, I think there are already success cases of RL-trained models today pursuing goals, but not in a completely ruthlessly way. I see it all the time working with coding agents. You can give them a task, and they write a bunch of code, try it, and sometimes succeed at the task. But they don’t always hack the reward function. For example, if they are trying to pass a unit test, they could just modify the test. This does occasionally happen. But the majority of the time, they understand my intent (or at least have been encourage via RL not to modify the success criteria) and thus complete the task how I desire, instead of the quick, easy, wrong solution. There is a common theme here in machine learning. You don’t want the model to cheat or shortcut. And you can get systems to be successful even when they know a cheat exists. As long as the training process sufficiently encourages such behavior. While I think this is reason for optimism, I wholeheartedly agree there are Major problems that yet exist. Like you mentioned, maintaining these desires for the “correct” solution instead of shortcuts across many model updates is likely hard. Also, the “correct” solution gets harder and harder to define as the complexity of the tradeoffs and system get more complicated. You could imagine a future AI Agent falling into a sub-optimal solution that no human has thought of. But the AI agent thinks it’s acceptable since so human ever mentioned ruling it out. The solution space is just so large, a notion of ethics or correctness needs to generalize beyond the training data we can provide. And beyond that there are still doom scenarios where an accident destroys the world. Maybe the AI is conducting some nuclear research to improve clean power, and it makes an unintended mistake that leads to a runway reaction.
The claim “modern LLMs can pursue goals and act like agents” does not contradict the claim “modern LLMs get their capabilities primarily from imitative learning”, right? Because there are examples in the pretraining data (including in the specially-commissioned proprietary expert-created data) where human-created text enacts the pursuit of goals. Right? See also here.
I do think LLMs struggle when they depart from what’s in the pretraining data, but the meaning of that is a bit tricky to pin down. Like, if I ask you to “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”, you can do that in a fraction of a second, and correctly answer follow-up questions about that scenario, even though this specific mental image has never happened before in the history of the world. And LLMs can do that kind of thing too. Is that “departing from what’s in the pretraining data”? My answer is: No, not in the sense that matters. When you say your LLM “make[s] conclusions that no human knows”, I suspect it’s a similar kind of thing: it’s not “departing from the pretraining data” in the sense that matters. Indeed, anything that a third party can simply read and immediately understand is not “departing from the pretraining data” in the sense that matters, even if the person didn’t already know it.
By contrast, if you don’t know linear algebra, you can’t simply read a linear algebra textbook and immediately understand it. You need to spend many days and weeks internalizing these new ideas.
Anyway, in the post, I tried to be maximally clear-cut, by using the example of how billions of humans over thousands of years inventing language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch, without angels dropping new training data from the heavens. I very strongly don’t believe that billions of LLMs over thousands of years, in a sealed datacenter without any human intervention or human data, could do that. That would be real departure from the pretraining data. See also here and the first section here.