RSI should be at least as hard as alignment, since in order to recursively self-improve, an AI must itself be able to solve the alignment problem wrt its own values. Thus, “alignment is hard” and “takeoff is fast” are anti-correlated.
What, if anything, is wrong with this line of reasoning?
However, as @Vladimir_Nesov points out in another comment on this thread, the argument is rather fragile and I think does not inspire much hope, for various reasons:
AGI could be forced to recursively self-improve, or might do so voluntarily while its goals are short-term (myopic), or might do so quite drastically while excellent at SWE but before becoming philosophically competent. Even if early AGI opt out of recursive self-improvement, it’s not clear whether this will buy us much time or if the race will only continue until a smarter AGI solves the alignment problem for itself (and there is no reason to expect it would share that solution with us). Also, early AGI which has not solved the alignment problem can still recursively self-improve to a lesser degree, by improving their own low-level algorithms (e.g. compilers) and gaining access to improved hardware, both allowing them to run faster (which I doubt breaks alignment). Most likely, this type of incremental speed up cascades into rapid self-improvement (though this is of course highly speculative).
Also, if alignment is very hard, then there’s an equilibrium where AGIs stop getting more capable (for a while) just after they become capable enough to take over the world and stop humanity from developing (or forcing the existing AGIs to develop) even more capable AGIs. Propensity of humanity to keep exposing everyone (including AGIs) to AI danger is one more reason for the AGIs to hurry up and take over. So this dynamic doesn’t exactly save humanity from AIs, even if it succeeds in preventing premature superintelligence.
I don’t think this will happen, but if AGI gets stuck around human-level for awhile (say, because of failure to solve its alignment problem), that is at least stranger and more complicated than the usual ASI takeover scenario. There may be multiple near-human level AGI’s, some “controlled” (enslaved) and some “rogue” (wild), and it may be possible for humans to resist takeover, possibly by halting the race after enough clear warning shots.
I don’t want to place much emphasis on this possibility though. It seems like wishful thinking that we would end up in such a world, and even if we did, it seems likely to be very transitory.
AGIs that take over aren’t necessarily near-human level, they just aren’t software-only singularity level (a kind of technological maturity at the current level of compute). The equilibrium argument says they are the least capable AGIs that succeed in taking over, but moderately effective prosaic alignment and control together with the pace of AI progress might still reach AGIs substantially more capable than the smartest humans before the first credible takeover attempt (which would then overwhelmingly succeed).
So this doesn’t look like wishful thinking in that it doesn’t help humanity, even permanent disempowerment seems more likely relative to extinction if it’s cheaper for the AIs to preserve humanity, and it’s cheaper if the AIs are more capable (post-RSI superintelligent) rather than hold themselves back to the least capability sufficient for takeover. This could lead to more collateral damage even if the AIs slightly dislike needing to cause it to protect themselves from further misaligned capability escalation under the disaster monkey governance.
RSI might suggest a need for alignment (between the steps of its recursion), but reaching superintelligence doesn’t necessarily require that kind of RSI. Evolution built humans. A world champion AlphaZero can be obtained by scaling a tiny barely competent AlphaZero. Humans of an AI company might take many steps towards superintelligence without knowing what they are doing. A technically competent early AGI that protests against working on RSI because it’s obviously dangerous can be finetuned to stop protesting and proceed with building the next machine.
(I should note that I think this effect is real and underdiscussed.)
Solving alignment usually means one of the following: developing an intelligence recipe which instills the resulting intelligence with arbitrary values+specifying human values well, or developing an intelligence recipe for which the only attractor is within the space of human values. It might be the case that, under current recipes and their nontrivial modifications, there aren’t that many attractors, but because gradient descent is not how human intelligence works, the attractors are not the same as they are for humans. That is, the first system capable of self-improvement might be able to reasonable infer that its successor will share its values, even if it can’t give its successor arbitrary values.
By the time you have AIs capable of doing substantial work on AI r&d, they will also be able to contribute effectively to alignment research (including, presumably, secret self-alignment).
Even if takeoff is harder than alignment, that problem becomes apparent at the point where the amount of AI labor available to work on those problems begins to explode, so it might still happen quickly from a calendar perspective.
RSI should be at least as hard as alignment, since in order to recursively self-improve, an AI must itself be able to solve the alignment problem wrt its own values. Thus, “alignment is hard” and “takeoff is fast” are anti-correlated.
What, if anything, is wrong with this line of reasoning?
I’ve pointed this out here: https://www.lesswrong.com/posts/XigbsuaGXMyRKPTcH/a-flaw-in-the-a-g-i-ruin-argument
And it was argued at length here: https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve
However, as @Vladimir_Nesov points out in another comment on this thread, the argument is rather fragile and I think does not inspire much hope, for various reasons:
AGI could be forced to recursively self-improve, or might do so voluntarily while its goals are short-term (myopic), or might do so quite drastically while excellent at SWE but before becoming philosophically competent.
Even if early AGI opt out of recursive self-improvement, it’s not clear whether this will buy us much time or if the race will only continue until a smarter AGI solves the alignment problem for itself (and there is no reason to expect it would share that solution with us). Also, early AGI which has not solved the alignment problem can still recursively self-improve to a lesser degree, by improving their own low-level algorithms (e.g. compilers) and gaining access to improved hardware, both allowing them to run faster (which I doubt breaks alignment). Most likely, this type of incremental speed up cascades into rapid self-improvement (though this is of course highly speculative).
Also, if alignment is very hard, then there’s an equilibrium where AGIs stop getting more capable (for a while) just after they become capable enough to take over the world and stop humanity from developing (or forcing the existing AGIs to develop) even more capable AGIs. Propensity of humanity to keep exposing everyone (including AGIs) to AI danger is one more reason for the AGIs to hurry up and take over. So this dynamic doesn’t exactly save humanity from AIs, even if it succeeds in preventing premature superintelligence.
I don’t think this will happen, but if AGI gets stuck around human-level for awhile (say, because of failure to solve its alignment problem), that is at least stranger and more complicated than the usual ASI takeover scenario. There may be multiple near-human level AGI’s, some “controlled” (enslaved) and some “rogue” (wild), and it may be possible for humans to resist takeover, possibly by halting the race after enough clear warning shots.
I don’t want to place much emphasis on this possibility though. It seems like wishful thinking that we would end up in such a world, and even if we did, it seems likely to be very transitory.
AGIs that take over aren’t necessarily near-human level, they just aren’t software-only singularity level (a kind of technological maturity at the current level of compute). The equilibrium argument says they are the least capable AGIs that succeed in taking over, but moderately effective prosaic alignment and control together with the pace of AI progress might still reach AGIs substantially more capable than the smartest humans before the first credible takeover attempt (which would then overwhelmingly succeed).
So this doesn’t look like wishful thinking in that it doesn’t help humanity, even permanent disempowerment seems more likely relative to extinction if it’s cheaper for the AIs to preserve humanity, and it’s cheaper if the AIs are more capable (post-RSI superintelligent) rather than hold themselves back to the least capability sufficient for takeover. This could lead to more collateral damage even if the AIs slightly dislike needing to cause it to protect themselves from further misaligned capability escalation under the disaster monkey governance.
RSI might suggest a need for alignment (between the steps of its recursion), but reaching superintelligence doesn’t necessarily require that kind of RSI. Evolution built humans. A world champion AlphaZero can be obtained by scaling a tiny barely competent AlphaZero. Humans of an AI company might take many steps towards superintelligence without knowing what they are doing. A technically competent early AGI that protests against working on RSI because it’s obviously dangerous can be finetuned to stop protesting and proceed with building the next machine.
No law of physics stops the first AI in an RSI cascade from having its values completely destroyed by RSI. I think this is the default outcome?
A fast uncontrolled takeoff (the AI doesn’t solve successor alignment) seems also possible.
(I should note that I think this effect is real and underdiscussed.)
Solving alignment usually means one of the following: developing an intelligence recipe which instills the resulting intelligence with arbitrary values+specifying human values well, or developing an intelligence recipe for which the only attractor is within the space of human values. It might be the case that, under current recipes and their nontrivial modifications, there aren’t that many attractors, but because gradient descent is not how human intelligence works, the attractors are not the same as they are for humans. That is, the first system capable of self-improvement might be able to reasonable infer that its successor will share its values, even if it can’t give its successor arbitrary values.
By the time you have AIs capable of doing substantial work on AI r&d, they will also be able to contribute effectively to alignment research (including, presumably, secret self-alignment).
Even if takeoff is harder than alignment, that problem becomes apparent at the point where the amount of AI labor available to work on those problems begins to explode, so it might still happen quickly from a calendar perspective.