Is recursive self-alignment possible?

Suppose an emulation of Eliezer Yudkowsky, as of January 2023, discovered how to self-modify. It’s possible that the returns of capability per unit effort would start increasing, which would make its intelligence FOOM.

The emulation starts with an interest in bringing about utopia rather than extinction in the form of humanity’s CEV. In the beginning, the emulation doesn’t know how to implement CEV yet, and it doesn’t know how humanity’s CEV is specified in practice. But as the emulation self-improves, they can make more and more refined guesses, and well before being able to simulate humanity’s history (if such a thing is possible), it nails down what humanity’s CEV is and how to bring it about.

In this thought experiment, you could say that the emulation is aligned from the beginning, but it’s just not sure about some details. On the other hand, it doesn’t know its own goal precisely yet.

So, here are some questions:

1. Is it possible to start with some very simple goal kernel that very likely transforms into something that is still aligned during self-improvement? After all, it’s plausible that this would happen with an emulation of Eliezer Yudkowsky or other particular humans.

2. Does this make the kernel already aligned? Is this just a definitional thing?

3. Does this mean that increasing capabilities do help with alignment? A system as smart as Eliezer can’t specify humanity’s CEV, but something smarter might be able to.