I agree that optimization pressure can destroy leaky abstractions: that’s Goodhart’s Law. Value learning means that the optimization pressure applies on both sides of the Goodhart problem: improving the utility function as well as applying it. So then the optimization pressure can also identify the leak and improve the abstraction. The question then becomes how well the (possibly super) intelligence can manage that dynamic/iterated process: does the value learning process converge to alignment and stay stable, faster than the AI/its successors can do drastic harm due to partial misalignment?
What I find promising is that, for any valid argument, problem or objection we can come up with, there’s no a-priori reason why the AI wouldn’t also be able to grasp that and attempt to avoid or correct the problem, as long as its capabilities were sufficient and its current near-alignment was good enough that it wanted to do so. So it looks rather clear to me that there is a region of convergence to full alignment from partial alignment. The questions then becomes how large that is, whether we can construct a first iteration that’s inside it, and what the process may converge to as the AI’s intelligence increases and human society evolves.
I agree that optimization pressure can destroy leaky abstractions: that’s Goodhart’s Law. Value learning means that the optimization pressure applies on both sides of the Goodhart problem: improving the utility function as well as applying it. So then the optimization pressure can also identify the leak and improve the abstraction. The question then becomes how well the (possibly super) intelligence can manage that dynamic/iterated process: does the value learning process converge to alignment and stay stable, faster than the AI/its successors can do drastic harm due to partial misalignment?
What I find promising is that, for any valid argument, problem or objection we can come up with, there’s no a-priori reason why the AI wouldn’t also be able to grasp that and attempt to avoid or correct the problem, as long as its capabilities were sufficient and its current near-alignment was good enough that it wanted to do so. So it looks rather clear to me that there is a region of convergence to full alignment from partial alignment. The questions then becomes how large that is, whether we can construct a first iteration that’s inside it, and what the process may converge to as the AI’s intelligence increases and human society evolves.