I like to consider humanity-AI alignment in light of brain-brain alignment. If the purpose of alignment is self-preservation at the simple scale and fulfilment of individual desires at the complex scale, then brain-brain alignment hasn’t faired greatly. While we as a species are still around, our track record is severely blemished.
Another scale of alignment to consider is the alignment of a single brain with itself. The brain given to us by natural selection is not perfect, despite being in near instantaneous communication with itself (as opposed to the limited communication bandwidth between humans). Being a human, you should be familiar with the struggle of aligning the numerous working parts of your brain on a moment-by-moment basis. While we as a species are still around, the rate of failure among humans for preservation and attainment of desire is awfully low (suicide, self-sabotage, etc.).
In light of this, I do find the idea of designing an intelligent agent, which does-what-I-mean-not-what-I-say, very strange. Where the goal is self-preservation and attainment of desire for both parties, there is nothing that suggests to me that one human can firstly decide very well what they mean, or secondly express what they have decided that they mean, through verbal or written communication, well enough to even align a fellow human (with a high success rate).
I am not suggesting that aligning a generally intelligent agent is impossible, just that at a brief glance it would appear more difficult than aligning two human brains or a single brain with itself. I am also not suggesting that this applies to agents that cannot set their own intention or are designed to have their intention modified by human input. I really have no intuition at all about agents that range between AlphaGo Zero and whatever comes just before humans in their capacity to generalise.
At this philosophical glance, to align one generally intelligent artificial entity with all of humanity’s values and desires seems very unlikely. True alignment could only come from an intelligent entity with bandwidth and architecture greater than that of the human brain, and that would still be an alignment with itself.
For me this intuition leads to the conclusion that the crux of the alignment problem is the poor architecture of the human brain and our bandwidth constraints, for even at the easiest point of alignment (single brain alignment) we see consistent failure. It would seem to me that alignment with artificial entities that at all compare to the generalisation capacity of humans should be forestalled till we can transition ourselves to a highly manipulable non-biological medium (with greater architecture and bandwidth than the human brain).