superintelligences apply a lot of optimization pressure, and pointing that optimization pressure in almost the right direction is not good enough
What’s your opinion of value learning? If the intelligence (unlike AIXI) understands that its current utility function is imperfect and should be improved, then it can intelligently prioritize improving the utility function against optimizing the current version of it, taking Goodhart’s Law and extrapolation outside the currently-known distribution into account. Then we have a dynamic situation, and we’re interested in what the utility function converges to under this optimization.
I meant what I said at a higher level of abstraction—optimization pressure may destroy leaky abstractions. I don’t think value learning immediately solves this.
I agree that optimization pressure can destroy leaky abstractions: that’s Goodhart’s Law. Value learning means that the optimization pressure applies on both sides of the Goodhart problem: improving the utility function as well as applying it. So then the optimization pressure can also identify the leak and improve the abstraction. The question then becomes how well the (possibly super) intelligence can manage that dynamic/iterated process: does the value learning process converge to alignment and stay stable, faster than the AI/its successors can do drastic harm due to partial misalignment?
What I find promising is that, for any valid argument, problem or objection we can come up with, there’s no a-priori reason why the AI wouldn’t also be able to grasp that and attempt to avoid or correct the problem, as long as its capabilities were sufficient and its current near-alignment was good enough that it wanted to do so. So it looks rather clear to me that there is a region of convergence to full alignment from partial alignment. The questions then becomes how large that is, whether we can construct a first iteration that’s inside it, and what the process may converge to as the AI’s intelligence increases and human society evolves.
What’s your opinion of value learning? If the intelligence (unlike AIXI) understands that its current utility function is imperfect and should be improved, then it can intelligently prioritize improving the utility function against optimizing the current version of it, taking Goodhart’s Law and extrapolation outside the currently-known distribution into account. Then we have a dynamic situation, and we’re interested in what the utility function converges to under this optimization.
I meant what I said at a higher level of abstraction—optimization pressure may destroy leaky abstractions. I don’t think value learning immediately solves this.
I agree that optimization pressure can destroy leaky abstractions: that’s Goodhart’s Law. Value learning means that the optimization pressure applies on both sides of the Goodhart problem: improving the utility function as well as applying it. So then the optimization pressure can also identify the leak and improve the abstraction. The question then becomes how well the (possibly super) intelligence can manage that dynamic/iterated process: does the value learning process converge to alignment and stay stable, faster than the AI/its successors can do drastic harm due to partial misalignment?
What I find promising is that, for any valid argument, problem or objection we can come up with, there’s no a-priori reason why the AI wouldn’t also be able to grasp that and attempt to avoid or correct the problem, as long as its capabilities were sufficient and its current near-alignment was good enough that it wanted to do so. So it looks rather clear to me that there is a region of convergence to full alignment from partial alignment. The questions then becomes how large that is, whether we can construct a first iteration that’s inside it, and what the process may converge to as the AI’s intelligence increases and human society evolves.