I think your argument does show that ‘safely aligning’ an AI requires significant engagement with human values. But I’m not convinced that it requires ‘learning human values’ well enough to successfully optimize the world.
In particular, I think it might be easier to recognize when effects are morally neutral than to recognize when they’re improvements. Or at least I don’t think the argument here convincingly shows that it isn’t.
My thought is that when deciding to take a morally neutral act with tradeoffs, the AI needs to be able to balance the positive and negative to get a reasonable acceptable tradeoff, and hence needs to know both positive and negative human values to achieve that.
I think your argument does show that ‘safely aligning’ an AI requires significant engagement with human values. But I’m not convinced that it requires ‘learning human values’ well enough to successfully optimize the world.
In particular, I think it might be easier to recognize when effects are morally neutral than to recognize when they’re improvements. Or at least I don’t think the argument here convincingly shows that it isn’t.
My thought is that when deciding to take a morally neutral act with tradeoffs, the AI needs to be able to balance the positive and negative to get a reasonable acceptable tradeoff, and hence needs to know both positive and negative human values to achieve that.