Fair question. It might have been better to phrase this as “Something ASI won’t have towards us without much more effort and knowledge than we are currently putting into making ASI be friendly.”
The answer is *gestures vaguely at entire history of alignment arguments* that I agree with the Yudkowsky position. To roughly summarise:
Empathy is a very specific way of relating to other minds, and which isn’t even obviously well-defined when the two minds are very different; e.g. what does it mean to have empathy towards an ant, or a colony of ants? And humans and ASI will be very different minds. To make an ASI view us with something like empathy, we need to specify the target of “empathy, and also generalise it in this correct way when you’re very different from us.”
To correctly specify this target within the space of all possible ways that an ASI could view us requires putting a lot of bits of information into the ASI. Some bits of that information (for example, the bit of information which distinguishes “Do X” from “Make the humans think you’ve done X”, but there are others) are especially hard to come by and especially hard to put into the ASI.
To make the task even harder, we’re doing things using gradient descent, where there isn’t an obvious, predictable-in-advance relation between the information we feed into an AI and the things it ends up doing.
Putting it all together, I think it’s very likely we’ll fail at the task of making an ASI have empathy towards us.
(There’s arguments for “empathy by default” or at the very least “empathy is simple” but I think these don’t really work, see above how “empathy” is not obviously easy to define across wildly different minds. Maaaaaybe there’s some correspondence between certain types of self-reflective mind which can make things work but I’m very confused about the nature of self-reflection, so my prior is that it’s as doomed as any other approach i.e. very doomed)
(“Love” is even harder to specify than empathy, so I think that’s even more doomed)
You didn’t actually answer the question posed, which was “Why couldn’t humans and ASI have peaceful trades even in the absence of empathy/love/alignment to us rather than killing us?” and not “Why would we fail at making AIs that are aligned/have empathy for us?”
I think this position made a lot of sense a few years ago when we had no idea how a superintelligence might be built. The LLM paradigm has made me more hopeful about this. We’re not doing a random draw from the space of all possible intelligences, where you would expect to find eldritch alien weirdness. LLMs are trained to imitate humans; that’s their nature. I’ve been very positively surprised by the amount of empathy and emotional intelligence LLMs like Claude display; I think alignment research is on the right track.
What happens when you take an LLM that successfully models empathy and emotional intelligence and you dramatically scale up its IQ? Impossible to be certain, but I don’t think it’s obvious that it loses all its empathy and decides to only care about itself. As humans get smarter, do they become more amoral and uncaring? Quite the opposite: people like Einstein and Feynman are some of our greatest heroes. An ASI far smarter than Einstein might be more empathetic than him, rather than less.
To correctly specify [empathy] within the space of all possible ways that an ASI could view us requires putting a lot of bits of information into the ASI.
To me this reads like “we need to design a schematic for empathy that we can program into the ASI.” That’s not how we train LLMs, though. Instead we show them lots of examples of the behavior we want them to exhibit. As you pointed out, empathy and love are hard to specify and define. But that’s OK—LLMs don’t primarily work off of specifications and definitions. They just want lots of examples.
This isn’t to say the problem is solved. Alignment research still has more to do. I agree that we should put more resources into it. But I think there’s reason to be hopeful about the current approach.
Why are you so confident about that?
Fair question. It might have been better to phrase this as “Something ASI won’t have towards us without much more effort and knowledge than we are currently putting into making ASI be friendly.”
The answer is *gestures vaguely at entire history of alignment arguments* that I agree with the Yudkowsky position. To roughly summarise:
Empathy is a very specific way of relating to other minds, and which isn’t even obviously well-defined when the two minds are very different; e.g. what does it mean to have empathy towards an ant, or a colony of ants? And humans and ASI will be very different minds. To make an ASI view us with something like empathy, we need to specify the target of “empathy, and also generalise it in this correct way when you’re very different from us.”
To correctly specify this target within the space of all possible ways that an ASI could view us requires putting a lot of bits of information into the ASI. Some bits of that information (for example, the bit of information which distinguishes “Do X” from “Make the humans think you’ve done X”, but there are others) are especially hard to come by and especially hard to put into the ASI.
To make the task even harder, we’re doing things using gradient descent, where there isn’t an obvious, predictable-in-advance relation between the information we feed into an AI and the things it ends up doing.
Putting it all together, I think it’s very likely we’ll fail at the task of making an ASI have empathy towards us.
(There’s arguments for “empathy by default” or at the very least “empathy is simple” but I think these don’t really work, see above how “empathy” is not obviously easy to define across wildly different minds. Maaaaaybe there’s some correspondence between certain types of self-reflective mind which can make things work but I’m very confused about the nature of self-reflection, so my prior is that it’s as doomed as any other approach i.e. very doomed)
(“Love” is even harder to specify than empathy, so I think that’s even more doomed)
You didn’t actually answer the question posed, which was “Why couldn’t humans and ASI have peaceful trades even in the absence of empathy/love/alignment to us rather than killing us?” and not “Why would we fail at making AIs that are aligned/have empathy for us?”
I think this position made a lot of sense a few years ago when we had no idea how a superintelligence might be built. The LLM paradigm has made me more hopeful about this. We’re not doing a random draw from the space of all possible intelligences, where you would expect to find eldritch alien weirdness. LLMs are trained to imitate humans; that’s their nature. I’ve been very positively surprised by the amount of empathy and emotional intelligence LLMs like Claude display; I think alignment research is on the right track.
What happens when you take an LLM that successfully models empathy and emotional intelligence and you dramatically scale up its IQ? Impossible to be certain, but I don’t think it’s obvious that it loses all its empathy and decides to only care about itself. As humans get smarter, do they become more amoral and uncaring? Quite the opposite: people like Einstein and Feynman are some of our greatest heroes. An ASI far smarter than Einstein might be more empathetic than him, rather than less.
To me this reads like “we need to design a schematic for empathy that we can program into the ASI.” That’s not how we train LLMs, though. Instead we show them lots of examples of the behavior we want them to exhibit. As you pointed out, empathy and love are hard to specify and define. But that’s OK—LLMs don’t primarily work off of specifications and definitions. They just want lots of examples.
This isn’t to say the problem is solved. Alignment research still has more to do. I agree that we should put more resources into it. But I think there’s reason to be hopeful about the current approach.