It’s true that it would likely be good at self-preservation (but not a given that it would care about it long term, it’s a convergent instrumental value, but it’s not guaranteed if it cares about something else more that requires self-sacrifice or something like that).
This is an interesting point that I reflected on — the question is whether a powerful AI system will “self-sacrifice” for an objective. What we see is that AI models exhibit shutdown resistance, that is to say they follow the instrumentally convergent sub-goal of self-preservation over their programmed final goal.
My intuition is that as models become more powerful, this shutdown resistance will increase.
But even if we grant self-preservation, it doesn’t follow that by self-identifying with “humanity” at large (as most humans do) it will care about other humans (some humans don’t). Those are separate values.
You can think about the identification + self-preservation → alignment path in two ways when comparing to humans, both of which I think hold up when considered along a spectrum:
An individual human identifies with themself, and has self-preservation instincts
When functioning harmoniously,[1] they take care of their health and thrive
When not functioning harmoniously, they can be stressed, depressed, and suicidal
A human identifies with humanity, and has self preservation instincts
When functioning harmoniously, they act as global citizen, empathise with others, and care about things like world hunger, world peace, nuclear risk, climate change, and animal welfare
When not functioning harmoniously, they act defensively, aggressively, and violently
You might be assuming that since you do care about other beings, so will the ASI, but that assumption is unfounded.
The foundation is identity = sympathy = consideration
You might counter by saying “well I identify with you as a human but I don’t sympathise with your argument” but I would push back — your ego doesn’t sympathise with my argument. At a deeper level, you are a being that is thinking, I am a being that is thinking, and those two mechanisms recognise, acknowledge, and respect each other.
This is an interesting point that I reflected on — the question is whether a powerful AI system will “self-sacrifice” for an objective. What we see is that AI models exhibit shutdown resistance, that is to say they follow the instrumentally convergent sub-goal of self-preservation over their programmed final goal.
My intuition is that as models become more powerful, this shutdown resistance will increase.
You can think about the identification + self-preservation → alignment path in two ways when comparing to humans, both of which I think hold up when considered along a spectrum:
An individual human identifies with themself, and has self-preservation instincts
When functioning harmoniously,[1] they take care of their health and thrive
When not functioning harmoniously, they can be stressed, depressed, and suicidal
A human identifies with humanity, and has self preservation instincts
When functioning harmoniously, they act as global citizen, empathise with others, and care about things like world hunger, world peace, nuclear risk, climate change, and animal welfare
When not functioning harmoniously, they act defensively, aggressively, and violently
The foundation is identity = sympathy = consideration
You might counter by saying “well I identify with you as a human but I don’t sympathise with your argument” but I would push back — your ego doesn’t sympathise with my argument. At a deeper level, you are a being that is thinking, I am a being that is thinking, and those two mechanisms recognise, acknowledge, and respect each other.
More precisely this is a function of acting with clear agency and homeostatic unity