I’d suggest that long-term corrigibility is a still easier target. If respecting future sentients’ preferences is the goal, why not make that the alignment target?
While boundaries are a coherent idea, imposing them in our alignment solutions would seem to very much be dictating the future rather than letting it unfold with protection from benevolent ASI.
In an easy world, boundaries are neutral, because you can set up corrigibility on the other side to eventually get aligned optimization there. The utility of boundaries is for worlds where we get values alignment or corrigibility wrong, and most of the universe eventually gets optimized in at least somewhat misaligned way.
Slight misalignment concern also makes personal boundaries in this sense an important thing to set up first, before any meaningful optimization changes people, as people are different from each other and initial optimization pressure might be less than maximally nuanced.
So it’s complementary and I suspect it’s a shard of human values that’s significantly easier to instill in this different-than-values role than either the whole thing or corrigibility towards it.
I’d suggest that long-term corrigibility is a still easier target. If respecting future sentients’ preferences is the goal, why not make that the alignment target?
While boundaries are a coherent idea, imposing them in our alignment solutions would seem to very much be dictating the future rather than letting it unfold with protection from benevolent ASI.
In an easy world, boundaries are neutral, because you can set up corrigibility on the other side to eventually get aligned optimization there. The utility of boundaries is for worlds where we get values alignment or corrigibility wrong, and most of the universe eventually gets optimized in at least somewhat misaligned way.
Slight misalignment concern also makes personal boundaries in this sense an important thing to set up first, before any meaningful optimization changes people, as people are different from each other and initial optimization pressure might be less than maximally nuanced.
So it’s complementary and I suspect it’s a shard of human values that’s significantly easier to instill in this different-than-values role than either the whole thing or corrigibility towards it.