My claim is that you’re reasonably likely to get a small preference for something like this by default even in the absence of serious effort to ensure alignment beyond what is commercially expedient. For instance, note that humans put some weight on “respect the actual preferences of existing intelligent agents” despite this being a “narrow target in mind-space”.
Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn’t expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.
But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?
My claim is that you’re reasonably likely to get a small preference for something like this by default even in the absence of serious effort to ensure alignment beyond what is commercially expedient. For instance, note that humans put some weight on “respect the actual preferences of existing intelligent agents” despite this being a “narrow target in mind-space”.
Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn’t expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.
But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?