The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as “rewriting your own source code in RAM.” According to the LessWrong wiki, RSI refers to “making improvements on one’s own ability of making self-improvements.” However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.
Why This Distinction Matters
Eliezer described the critical transition as when “the AI’s metacognitive level has now collapsed to identity with the AI’s object level.” I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.
This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn’t good enough, but that doesn’t mean it understands how to increase it. Without this self-transparency, I don’t think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.
The dynamics differ significantly:
True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
AI-automated research: Systems don’t understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves
Alignment Preservation
This difference has significant implications:
True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals
Loss of Human Control
The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.
Could We Still Get True RSI?
Probably eventually, e.g. through automated researchers discovering more interpretable architectures.
I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it’s taking a different form than he anticipated.
Taking this into account, it seems important for interpretability researchers to consider the risk that their work enables RSI, particularly if their interpretability methods provide ways to directly edit the AI itself.
It’s always been a concern that interpretability research could accelerate AI R&D, but I think this consideration is more worrying if you take into account RSI. Compared to humans, AI is good at doing simple, repetitive tasks, but it’s very hard for it to make even one big conceptual breakthrough. Interpretability methods lend themselves to the former type of task: if an AI were sufficiently interpretable, you could tell it to look at millions of tiny circuits in its own brain and tweak them to improve performance.
Yup, I put a high quality interpretability pipeline that the AI systems can use on themselves as one of the most likely things to be the proximal cause of game over.
(This was sitting in my drafts, but I’ll just comment it here bc it’s very similar point.)
There are two forms of “Recursive Self-Improvement” that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition’s observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
Introspective RSI (left) vs Extrospective RSI (right)
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.
I think Eliezer meant “self” very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don’t know if Eliezer would approve of the term “Extrospective Recursive Self Improvemnt”.
Does AI-automated AI R&D count as “Recursive Self-Improvement”? I’m not sure what Yudkowsky would say, but regardless, enough people would count it that I’m happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.
I analyzed internal structure of RSI some time ago and concluded that it will be not as easy at it may seem because of the need of secret self-testing. But on some levels it may be effective, like learning new principles of thinking. Levels of AI Self-Improvement.
Right now, AI capability advances are driven by compute scaling, human ML research, and ML experiments. Transparency and direct modification of models do not have good returns to AI capabilities. What reasons are there to think transparency and direct modification would have better returns in the future?
The Term Recursive Self-Improvement Is Often Used Incorrectly
Also on my substack.
The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as “rewriting your own source code in RAM.” According to the LessWrong wiki, RSI refers to “making improvements on one’s own ability of making self-improvements.” However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.
Why This Distinction Matters
Eliezer described the critical transition as when “the AI’s metacognitive level has now collapsed to identity with the AI’s object level.” I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.
This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn’t good enough, but that doesn’t mean it understands how to increase it. Without this self-transparency, I don’t think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.
The dynamics differ significantly:
True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
AI-automated research: Systems don’t understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves
Alignment Preservation
This difference has significant implications:
True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals
Loss of Human Control
The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.
Could We Still Get True RSI?
Probably eventually, e.g. through automated researchers discovering more interpretable architectures.
I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it’s taking a different form than he anticipated.
See also: Nate Soares has also written about RSI in this narrow sense. Comments between Nate and Paul Christiano touch on this topic.
Taking this into account, it seems important for interpretability researchers to consider the risk that their work enables RSI, particularly if their interpretability methods provide ways to directly edit the AI itself.
It’s always been a concern that interpretability research could accelerate AI R&D, but I think this consideration is more worrying if you take into account RSI. Compared to humans, AI is good at doing simple, repetitive tasks, but it’s very hard for it to make even one big conceptual breakthrough. Interpretability methods lend themselves to the former type of task: if an AI were sufficiently interpretable, you could tell it to look at millions of tiny circuits in its own brain and tweak them to improve performance.
Yup, I put a high quality interpretability pipeline that the AI systems can use on themselves as one of the most likely things to be the proximal cause of game over.
(This was sitting in my drafts, but I’ll just comment it here bc it’s very similar point.)
There are two forms of “Recursive Self-Improvement” that people often conflate, but they have very different characteristics.
Introspective RSI: Much like a human, an AI will observe, understand, and modify its own cognitive processes. This ability is privileged: the AI can make these self-observations and self-modifications because the metacognition and mesocognition occur within the same entity. While performing cognitive tasks, the AI simultaneously performs the meta-cognitive task of improving its own cognition.
Extrospective RSI: AIs will automate various R&D tasks that humans currently perform to improve AI, using similar workflows that humans currently use. For example, studying literature, forming hypotheses, writing code, running experiments, analyzing data, drawing conclusions, and publishing results. The object-level cognition and meta-level cognition occur in different entities.
I wish people were more careful about the distinction because people carelessly generalise cached opinions about the former to the latter. In particular, the former seems more dangerous: there is less opportunity to monitor the metacognition’s observation and modification of the mesocognition if these cognitions occur within the same entity, i.e. activations, chain-of-thought.
I expect the line to blur between introspective and extrospective RSI. For example, you could imagine AIs trained for interp to doing interp on themselves, directly interpretting their own activations/internals and then making modifications while running.
I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.
I think Eliezer meant “self” very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don’t know if Eliezer would approve of the term “Extrospective Recursive Self Improvemnt”.
Does AI-automated AI R&D count as “Recursive Self-Improvement”? I’m not sure what Yudkowsky would say, but regardless, enough people would count it that I’m happy to concede some semantic territory. The best thing (imo) is just to distinguish them with an adjective.
I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.
I analyzed internal structure of RSI some time ago and concluded that it will be not as easy at it may seem because of the need of secret self-testing. But on some levels it may be effective, like learning new principles of thinking. Levels of AI Self-Improvement.
Right now, AI capability advances are driven by compute scaling, human ML research, and ML experiments. Transparency and direct modification of models do not have good returns to AI capabilities. What reasons are there to think transparency and direct modification would have better returns in the future?