I think it’s plausible that an AI that has this kind of caring could exist, but actually getting this AI instead of one that doesn’t care at all seems very unlikely.
IMO “respect the actual preferences of existing intelligent agents” is a very narrow target in mind-space. I.e. if we had any reason to believe the AI has a decent chance of being this kind of mind, the alignment problem would be 90% solved. The hard part is going from “AI that kills everyone” to “AI that doesn’t kill everyone”. Once you’re there, getting to”AI that benefits humanity, or at least leaves for another star system” is comparatively trivial.
My claim is that you’re reasonably likely to get a small preference for something like this by default even in the absence of serious effort to ensure alignment beyond what is commercially expedient. For instance, note that humans put some weight on “respect the actual preferences of existing intelligent agents” despite this being a “narrow target in mind-space”.
Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn’t expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.
But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?
IMO “respect the actual preferences of existing intelligent agents” is a very narrow target in mind-space
The MindSpace of the Orthogonality Thesis is a set of possibilities. The random potshot version of the OT argument, has it that there is an even chance of hitting any mind, and therefore a high chance ideas of hitting an eldritch, alien mind. But equiprobability is only one way of turning possibilities into probabilities, and not particularly realistic. Random potshots aren’t analogous to the probability density for action of building a certain type of AI, without knowing much about what it would be.
While, many of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. We are deliberately seeking to build duplicates of our minds for one thing, and have certain limitations, for another.
we are deliberately seeking to build certainties of mind
I think “deliberately seeking to build” is the wrong way to frame the current paradigm—we’re growing the AIs through a process we don’t fully understand, while trying to steer the external behaviour in the hopes that this corresponds to desirable mind structures.
If we were actually building the AIs, I would be much more optimistic about them coming out friendly.
Not fully understanding things is the default … even non AI software can’t be fully understood if it is complex enough. We already know how to probe systems we don’t understand apriori, through scientific experimentation. You don’t have to get alignment right first time, at least not without the foom/RRSI or incorrigibility assumptions.
The difference with normal software is that at least somebody understands every individual part, and if you collected all those somebodies and locked them in a room for a while they could write up a full explanation. Whereas with AI I think we’re not even like 10% of the way to full understanding.
Also, if you’re trying to align a superintelligence, you do have to get it right on the first try, otherwise it kills you with no counterplay.
I think it’s plausible that an AI that has this kind of caring could exist, but actually getting this AI instead of one that doesn’t care at all seems very unlikely.
IMO “respect the actual preferences of existing intelligent agents” is a very narrow target in mind-space. I.e. if we had any reason to believe the AI has a decent chance of being this kind of mind, the alignment problem would be 90% solved. The hard part is going from “AI that kills everyone” to “AI that doesn’t kill everyone”. Once you’re there, getting to”AI that benefits humanity, or at least leaves for another star system” is comparatively trivial.
My claim is that you’re reasonably likely to get a small preference for something like this by default even in the absence of serious effort to ensure alignment beyond what is commercially expedient. For instance, note that humans put some weight on “respect the actual preferences of existing intelligent agents” despite this being a “narrow target in mind-space”.
Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn’t expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.
But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?
The MindSpace of the Orthogonality Thesis is a set of possibilities. The random potshot version of the OT argument, has it that there is an even chance of hitting any mind, and therefore a high chance ideas of hitting an eldritch, alien mind. But equiprobability is only one way of turning possibilities into probabilities, and not particularly realistic. Random potshots aren’t analogous to the probability density for action of building a certain type of AI, without knowing much about what it would be.
While, many of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. We are deliberately seeking to build duplicates of our minds for one thing, and have certain limitations, for another.
I think “deliberately seeking to build” is the wrong way to frame the current paradigm—we’re growing the AIs through a process we don’t fully understand, while trying to steer the external behaviour in the hopes that this corresponds to desirable mind structures.
If we were actually building the AIs, I would be much more optimistic about them coming out friendly.
Not fully understanding things is the default … even non AI software can’t be fully understood if it is complex enough. We already know how to probe systems we don’t understand apriori, through scientific experimentation. You don’t have to get alignment right first time, at least not without the foom/RRSI or incorrigibility assumptions.
The difference with normal software is that at least somebody understands every individual part, and if you collected all those somebodies and locked them in a room for a while they could write up a full explanation. Whereas with AI I think we’re not even like 10% of the way to full understanding.
Also, if you’re trying to align a superintelligence, you do have to get it right on the first try, otherwise it kills you with no counterplay.
That has not been demonstrated.
( “Gestures towards IABIED”
“Gestures towards critiques thereof”)