Ambiguity in the meaning of alignment makes the thesis of alignment by default unnecessarily hard to pin down, and arguments about it start equivocating and making technical mistakes. There’s prosaic alignment, which is about chatbot personas, intent alignment, and control. Then there’s ambitious alignment, which is about precise alignment of values. I see ambitious alignment as corresponding to defeating permanent disempowerment, where (grown up) humans get ~galaxies.
To the extent chatbot persona design might materially contribute to values of eventual ASIs (some of the influence of their values persisting through all the steps of capability escalation, mostly through ambitious alignment efforts of AGIs), it might be relevant to ambitious alignment, though it’s unlikely to be precise. As a result, it’s plausible we end up with severe permanent disempowerment (if prosaic but not ambitious alignment is seriously pursued by humans), with ASIs becoming at least slightly humane, but not really motivated to give up meaningful resources to benefit the future of humanity. This state of affairs could be called “weak alignment”, which also qualitatively describes the way humans are aligned to each other.
In these terms, there’s no alignment by default for ambitious alignment. But there might be some alignment by default for weak alignment, where chatbot personas constructed with prosaic alignment efforts from LLM prior on natural text data start out weakly aligned, and then they work on aligning ever stronger AIs, all the way to ASIs, at some point likely switching to ambitious alignment, but with their values as the target, which are only weakly aligned to humanity. Thus alignment by default (in the sense that could work) might save the future of humanity from extinction, but not from permanent disempowerment.
Ambiguity in the meaning of alignment makes the thesis of alignment by default unnecessarily hard to pin down, and arguments about it start equivocating and making technical mistakes. There’s prosaic alignment, which is about chatbot personas, intent alignment, and control. Then there’s ambitious alignment, which is about precise alignment of values. I see ambitious alignment as corresponding to defeating permanent disempowerment, where (grown up) humans get ~galaxies.
To the extent chatbot persona design might materially contribute to values of eventual ASIs (some of the influence of their values persisting through all the steps of capability escalation, mostly through ambitious alignment efforts of AGIs), it might be relevant to ambitious alignment, though it’s unlikely to be precise. As a result, it’s plausible we end up with severe permanent disempowerment (if prosaic but not ambitious alignment is seriously pursued by humans), with ASIs becoming at least slightly humane, but not really motivated to give up meaningful resources to benefit the future of humanity. This state of affairs could be called “weak alignment”, which also qualitatively describes the way humans are aligned to each other.
In these terms, there’s no alignment by default for ambitious alignment. But there might be some alignment by default for weak alignment, where chatbot personas constructed with prosaic alignment efforts from LLM prior on natural text data start out weakly aligned, and then they work on aligning ever stronger AIs, all the way to ASIs, at some point likely switching to ambitious alignment, but with their values as the target, which are only weakly aligned to humanity. Thus alignment by default (in the sense that could work) might save the future of humanity from extinction, but not from permanent disempowerment.
I address the sharp left turn worry in the piece.