It’s probably implied by CEV. The point is that you don’t need the whole CEV to get it, it’s probably easier to get, a simpler concept and a larger alignment target that might be sufficient to at least notkilleveryone, even if in the end we lose most of the universe. Also, you gain the opportunity to work on CEV and eventually get there, even if you have many OOMs less resources to work with. It would of course be better to get CEV before building ASIs with different values or going on a long value drift trip ourselves.
I’d suggest that long-term corrigibility is a still easier target. If respecting future sentients’ preferences is the goal, why not make that the alignment target?
While boundaries are a coherent idea, imposing them in our alignment solutions would seem to very much be dictating the future rather than letting it unfold with protection from benevolent ASI.
In an easy world, boundaries are neutral, because you can set up corrigibility on the other side to eventually get aligned optimization there. The utility of boundaries is for worlds where we get values alignment or corrigibility wrong, and most of the universe eventually gets optimized in at least somewhat misaligned way.
Slight misalignment concern also makes personal boundaries in this sense an important thing to set up first, before any meaningful optimization changes people, as people are different from each other and initial optimization pressure might be less than maximally nuanced.
So it’s complementary and I suspect it’s a shard of human values that’s significantly easier to instill in this different-than-values role than either the whole thing or corrigibility towards it.
It’s probably implied by CEV. The point is that you don’t need the whole CEV to get it, it’s probably easier to get, a simpler concept and a larger alignment target that might be sufficient to at least notkilleveryone, even if in the end we lose most of the universe. Also, you gain the opportunity to work on CEV and eventually get there, even if you have many OOMs less resources to work with. It would of course be better to get CEV before building ASIs with different values or going on a long value drift trip ourselves.
I’d suggest that long-term corrigibility is a still easier target. If respecting future sentients’ preferences is the goal, why not make that the alignment target?
While boundaries are a coherent idea, imposing them in our alignment solutions would seem to very much be dictating the future rather than letting it unfold with protection from benevolent ASI.
In an easy world, boundaries are neutral, because you can set up corrigibility on the other side to eventually get aligned optimization there. The utility of boundaries is for worlds where we get values alignment or corrigibility wrong, and most of the universe eventually gets optimized in at least somewhat misaligned way.
Slight misalignment concern also makes personal boundaries in this sense an important thing to set up first, before any meaningful optimization changes people, as people are different from each other and initial optimization pressure might be less than maximally nuanced.
So it’s complementary and I suspect it’s a shard of human values that’s significantly easier to instill in this different-than-values role than either the whole thing or corrigibility towards it.