Thanks for the post! I agree on many of the points, but have a few additions:
First, using “LLM” instead of something like “Claude Code” throughout the piece feels it doesn’t capture the current capabilities of AI systems. An “LLM in a loop” a.k.a “AI Agent” e.g. “Claude Code” is a different animal than base LLMs. Through search capabilities, AI Agents have much longer effective context windows. Through coding, they can have much greater reliability and precision in things like math. Through queries to API’s, they can have much more real time data. By writing to output files, they can track progress, thoughts, and ideas much more effectively. And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows. Getting this knowledge back into the base LLM isn’t easy today like you mentioned. Since the local instance of Claude Code is the only AI who knows about the finding. But you can imagine a future when such findings are automatically added to the dataset for the next pretraining run. Slow and expensive, but technically recursive self improvement? Yes.
Second, I think there are already success cases of RL-trained models today pursuing goals, but not in a completely ruthlessly way. I see it all the time working with coding agents. You can give them a task, and they write a bunch of code, try it, and sometimes succeed at the task. But they don’t always hack the reward function. For example, if they are trying to pass a unit test, they could just modify the test. This does occasionally happen. But the majority of the time, they understand my intent (or at least have been encourage via RL not to modify the success criteria) and thus complete the task how I desire, instead of the quick, easy, wrong solution. There is a common theme here in machine learning. You don’t want the model to cheat or shortcut. And you can get systems to be successful even when they know a cheat exists. As long as the training process sufficiently encourages such behavior. While I think this is reason for optimism, I wholeheartedly agree there are Major problems that yet exist. Like you mentioned, maintaining these desires for the “correct” solution instead of shortcuts across many model updates is likely hard. Also, the “correct” solution gets harder and harder to define as the complexity of the tradeoffs and system get more complicated. You could imagine a future AI Agent falling into a sub-optimal solution that no human has thought of. But the AI agent thinks it’s acceptable since so human ever mentioned ruling it out. The solution space is just so large, a notion of ethics or correctness needs to generalize beyond the training data we can provide. And beyond that there are still doom scenarios where an accident destroys the world. Maybe the AI is conducting some nuclear research to improve clean power, and it makes an unintended mistake that leads to a runway reaction.
Thanks for the post! I agree on many of the points, but have a few additions:
First, using “LLM” instead of something like “Claude Code” throughout the piece feels it doesn’t capture the current capabilities of AI systems. An “LLM in a loop” a.k.a “AI Agent” e.g. “Claude Code” is a different animal than base LLMs. Through search capabilities, AI Agents have much longer effective context windows. Through coding, they can have much greater reliability and precision in things like math. Through queries to API’s, they can have much more real time data. By writing to output files, they can track progress, thoughts, and ideas much more effectively. And they can make novel discoveries about the world. In my own research work, I’ve seen a coding agent develop an analysis from a vague prompt, execute the code, iterate through error/issues, and make conclusions that no human knows. Getting this knowledge back into the base LLM isn’t easy today like you mentioned. Since the local instance of Claude Code is the only AI who knows about the finding. But you can imagine a future when such findings are automatically added to the dataset for the next pretraining run. Slow and expensive, but technically recursive self improvement? Yes.
Second, I think there are already success cases of RL-trained models today pursuing goals, but not in a completely ruthlessly way. I see it all the time working with coding agents. You can give them a task, and they write a bunch of code, try it, and sometimes succeed at the task. But they don’t always hack the reward function. For example, if they are trying to pass a unit test, they could just modify the test. This does occasionally happen. But the majority of the time, they understand my intent (or at least have been encourage via RL not to modify the success criteria) and thus complete the task how I desire, instead of the quick, easy, wrong solution. There is a common theme here in machine learning. You don’t want the model to cheat or shortcut. And you can get systems to be successful even when they know a cheat exists. As long as the training process sufficiently encourages such behavior. While I think this is reason for optimism, I wholeheartedly agree there are Major problems that yet exist. Like you mentioned, maintaining these desires for the “correct” solution instead of shortcuts across many model updates is likely hard. Also, the “correct” solution gets harder and harder to define as the complexity of the tradeoffs and system get more complicated. You could imagine a future AI Agent falling into a sub-optimal solution that no human has thought of. But the AI agent thinks it’s acceptable since so human ever mentioned ruling it out. The solution space is just so large, a notion of ethics or correctness needs to generalize beyond the training data we can provide. And beyond that there are still doom scenarios where an accident destroys the world. Maybe the AI is conducting some nuclear research to improve clean power, and it makes an unintended mistake that leads to a runway reaction.