First, I think the overall conclusion is right in a narrow sense: we can’t solve AI alignment in that we can’t develop an air tight proof that one agent’s values are aligned with another’s due to the epistemic gap between the experiences of the two agents where values are calculated. Put another way, values are insufficiently observable to be able to say whether two agents are aligned. See “Formally stating AI alignment” and “Outline of metarationality” for the models that lead me to draw this conclusion.
Second, as a consequence of what I’ve just described, I think any impossibility resulting from the irreconcilability of human preferences is irrelevant because the impossibility of perfect alignment is dominated by the above claim. I think you are right that we can’t reconcile human preferences so long as they are irrational, and attempts at this are likely to result in repugnant outcomes if we hand the reconciled utility function to a maximizer. It’s just less of a concern for alignment because we never get to the point where we could have failed because we couldn’t solve this problem (although we could generate our own failure by trying to “solve” preference reconciliation without having solved alignment).
Third, I appreciate the attempt to put this in a fictional story that might be more widely read than technical material, but my expectation is that right now most of the people you need to convince of this point are more likely to engage with it through technical material than fiction, although I may be wrong about this so I appreciate the diversity of presentation styles.
Ok, so a few things to say here.
First, I think the overall conclusion is right in a narrow sense: we can’t solve AI alignment in that we can’t develop an air tight proof that one agent’s values are aligned with another’s due to the epistemic gap between the experiences of the two agents where values are calculated. Put another way, values are insufficiently observable to be able to say whether two agents are aligned. See “Formally stating AI alignment” and “Outline of metarationality” for the models that lead me to draw this conclusion.
Second, as a consequence of what I’ve just described, I think any impossibility resulting from the irreconcilability of human preferences is irrelevant because the impossibility of perfect alignment is dominated by the above claim. I think you are right that we can’t reconcile human preferences so long as they are irrational, and attempts at this are likely to result in repugnant outcomes if we hand the reconciled utility function to a maximizer. It’s just less of a concern for alignment because we never get to the point where we could have failed because we couldn’t solve this problem (although we could generate our own failure by trying to “solve” preference reconciliation without having solved alignment).
Third, I appreciate the attempt to put this in a fictional story that might be more widely read than technical material, but my expectation is that right now most of the people you need to convince of this point are more likely to engage with it through technical material than fiction, although I may be wrong about this so I appreciate the diversity of presentation styles.