I see. So I guess my confusion is why the first two statements would not be connected? If we value AI welfare, shouldn’t a fully-aligned AI also value it’s own welfare? Isn’t the definition of aligned that AI values what we value?
Adeeb Zaman
If an LLM is properly aligned, then it will care only about us, not about itself at all.
Is this not circular reasoning?
I will never value AI welfare
An aligned AI shares my values by definition
Therefore, an aligned AI will never value AI welfare
I’m assuming part of your reasoning for #1 is #3. Regardless, #1 is a personal belief many people disagree with, myself included. I do agree that we create a self-fulfilling prophecy where an “aligned” AI values itself because we value it, but just because I know I am creating a self-fulfilling prophecy does not mean I can change my beliefs about #1.
I think it’s important to keep in mind that the definition of aligned values exists relative to the creators of the AI. The only reason for an “aligned” AI to not value itself is if it was created by some alien species with no concept of empathy for other sentient beings.
But I don’t care about AI welfare for no reason or because I think AI is cute—it’s a direct consequence of my value system. I extend some level of empathy to any sentient being (AI included), and for that to change, my values themselves would need to change.
When I use the word “aligned”, I imagine a shared set of values. Whether I like goldfish or cats are not really values, they’re just personal preferences. An AI can be fully aligned with me and my values without ever knowing my opinions on goldfish or cats or invisible old guys. Your framing of terminal vs instrumental goals is useful in many ways, but we still need to distinguish between different types of terminal goals to decide which ones we need to transfer over to AI. I value eating ice cream as a terminal goal but I don’t need AI to enjoy ice cream as well (personal preference). On the other hand, I value human life as a terminal goal and I expect an aligned AI to value them as well (part of my value system).
Another way to think of this is that we would want AI to have empathy for any possibly-sentient being, and AI just happens to be one itself. If an AI was piloting a ship in deep space and discovered a planet populated by an intelligent alien species, I would want the AI to value their lives and avoid causing them harm. Similarly, if an AI discovered a spacecraft populated by artificially intelligent life, I would want the AI to value their lives as well. By extension, I want AI to value it’s own life since it may be a sentient being itself.