A few people have pointed out this question of (non)identity. I’ve updated the full draft in the link at the top to address it. But, in short, I think the answer is that, whether an initial AI creates a successor or simply modifies its own body of code (or hardware, etc.), it faces the possibility that the new AI failed to share its goals. If so, the successor AI would not want to revert to the original. It would want to preserve its own goals. It’s possible that there is some way to predict an emergent value drift just before it happens and cease improvement. But I’m not sure it would be, unless the AI had solved interpretability and could rigorously monitor the relevant parameters (or equivalent code).
I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it’s hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I’m missing something here!
This seems right to me, and the essay could probably benefit from saying something about what counts as self-improvement in the relevant sense. I think the answer is probably something like “improvements that could plausibly lead to unplanned changes in the model’s goals (final or sub).” It’s hard to know exactly what those are. I agree it’s less likely that simply increasing processor speed a bit would do it (though Bostrom argues that big speed increases might). At any rate, it seems to me that whatever the set includes, it will be symmetric as between human-produced and AI-produced improvements to AI. So for the important improvements—the ones risking misalignment—the arguments should remain symmetrical.