“Subgoal” content has desirability strictly contingent on predicted outcomes. “Child
goals” derive desirability from “parent goals”; if state A is desirable (or undesirable),
and state B is predicted to lead to state A, then B will inherit some desirability (or
undesirability) from A. B’s desirability will be contingent on the continued desirability
of A and on the continued expectation that B will lead to A.
“Supergoal” content is the wellspring of desirability within the goal system. The
distinction is roughly the distinction between “means” and “ends.”
Within a Friendly AI, Friendliness is the sole top-level supergoal. Other behaviors,
such as “self-improvement,” are subgoals; they derive their desirability from the desirability
of Friendliness. For example, self-improvement is predicted to lead to a more
effective future AI, which, if the future AI is Friendly, is predicted to lead to greater
fulfillment of the Friendliness supergoal.
Friendliness does not overrule other goals; rather, other goals’ desirabilities are derived
from Friendliness. Such a goal system might be called a cleanly Friendly or purely Friendly
goal system.
Sometimes, most instances of C lead to B, and most instances
of B lead to A, but no instances of C lead to A. In this case, a smart reasoning
system will not predict (or will swiftly correct the failed prediction) that “C normally
leads to A.”
If C normally leads to B, and B normally leads to A, but C never leads
to A, then B has normally-leads-to-A-ness, but C does not inherit normally-leads-to-
A-ness. Thus, B will inherit desirability from A, but C will not inherit desirability from
B. In a causal goal system, the quantity called desirability means leads-to-supergoal-ness.
Friendliness does not overrule other goals; rather, other goals’ desirabilities are derived
from Friendliness. A “goal” which does not lead to Friendliness will not be overruled by
the greater desirability of Friendliness; rather, such a “goal” will simply not be perceived
as “desirable” to begin with. It will not have leads-to-supergoal-ness.
But what if there are advantages to not making “Friendliness” the supergoal? What if making the supergoal something else, from which Friendliness derives importance under most circumstances, is a better approach? Not “safer”. “better”.
Something like “be a good galactic citizen”, where that translates to being a utilitarian wanting to benefit all species (both AI species and organics), with a strong emphasis upon some quality such as valuing the preservation of diversity and gratitude towards parental species that do themselves also try (within their self-chosen identity limitations) to also be good galactic citizens?
I’m not saying that such a higher level supergoal can be safely written. I don’t know. I do think the possibility that there might be one is worth considering, for three reasons:
It is anthropomorphic to suggest “Well, we’d resent slavery if apes had done it to us, so we shouldn’t do it to a species we create.” But, like in David Brin’s uplift series, there’s an argument about alien contact that warns that we may be judged by how we’ve treated others. So even if the AI species we create doesn’t resent it, others may resent it on their behalf. (Including an outraged PETA like faction of humanity that then decides to ‘liberate’ the enslaved AIs.)
Secondly, if there are any universals to ethical behaviour, that intelligent beings who’ve never even met or been influenced by humanity might independently recreate, you can be pretty sure that slavish desire to submit to just one particular species won’t feature heavily in them.
If we want the programmer of the AI to transfer to the AI the programmer’s own basis for coming up with how to behave, the programmer might be a human-speciesist (like a racial supremacist, or nationalist, only broader), but if they’re both moral and highly intelligent, then the AI will eventually gain the capacity to realise that the programmer probably wouldn’t, for example, enslave a biological alien race that humanity happened to encounter out in space, just in order to keep humanity safe.
But what if there are advantages to not making “Friendliness” the supergoal? What if making the supergoal something else, from which Friendliness derives importance under most circumstances, is a better approach? Not “safer”. “better”.
I don’t understand this. Forgive my possible naivety, but wasn’t it agreed-upon by FAI researchers that “Friendliness” as a supergoal meant that the AI would find ways to do things that are “better” for humanity overall in its prediction of the grand schemes of things.
This would include “being a good galactic citizen” with no specific preference for humanity if the freedom, creativity, fairness, public perception by aliens, or whatever other factor of influence led this goal to being superior in terms of achieving human values and maximizing collective human utility.
It was also my understanding that solving the problems with the above and finding out how to go about practically creating such a system that can consider what is best for humanity and figuring out how to code into the AI all that humans mean by “better, not just friendly” are all core goals of FAI research, and all major long-term milestones for MIRI.
wasn’t it agreed-upon by FAI researchers that “Friendliness” as a supergoal
meant that the AI would find ways to do things that are “better” for humanity
overall in its prediction of the grand schemes of things.
This would include “being a good galactic citizen” with no specific preference
for humanity if the freedom, creativity, fairness, public perception by aliens, or
whatever other factor of influence led this goal to being superior in terms of
achieving human values and maximizing collective human utility.
I’m glad to hear it.
But I think there is a distinction here worth noting, between two positions:
POSITION ONE—Make “Be a good galactic citizen” be the supergoal if and only if setting that as the supergoal is the action that maximises the chances of the AI, in practice, ending up doing stuff to help humanity in the long term, once you take interfering aliens, etc into account
and
POSITION TWO—Make “Be a good galactic citizen” be the supergoal, even if that isn’t quite as certain an approach to helping humanity in particular, as setting “be friendly to humanity” as the supergoal would be.
Why on earth would anyone suggest that AI researchers follow an approach that isn’t the absolute safest for humanity? That’s a big question. But one I think worth considering, if we open the possibility that there is a bit of wiggle room for setting a supergoal that will still be ok for humanity, but be slightly more moral.
Correct me if I’m wrong, but it sounds to me like you’re operating from a definition of Friendliness that is something like, “be good to humans.” Whereas, my understanding is that Friendliness is more along the lines of “do what we would want you to do if we were smarter / better.” So, if we would want an AI to be a good galactic citizen if we thought about it more, that’s what it would do.
Does your critique still apply to this CEV-type definition of Friendliness?
I thought it wasn’t so much “do what we would want you to do if we were better”, as “be good to humans, using the definitions of ‘good’ and ‘humans’ that we’d supply if we were better at anticipating what will actually benefit us and the consequences of particular ways of wording constraints”.
Because couldn’t it decide that a better human would be purely altruistic and want to turn over all the resources in the universe to a species able to make more efficient use of them?
I have more questions than answers, and I’d be suspicious of anyone who, at this stage, was 100% certain that they knew a foolproof way to word things.
I agree with you about not knowing any foolproof wording. In terms of what Eliezer had in mind though, here’s what the LessWrong wiki has to say on CEV:
In calculating CEV, an AI would predict what an idealized version of us would want, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together”.
http://wiki.lesswrong.com/wiki/CEV
So it’s not just, “be good to humans,” but rather, “do what (idealized) humans would want you to.” I think it’s an open question whether those would be the same thing.
Eliezer also wrote:
But what if there are advantages to not making “Friendliness” the supergoal? What if making the supergoal something else, from which Friendliness derives importance under most circumstances, is a better approach? Not “safer”. “better”.
Something like “be a good galactic citizen”, where that translates to being a utilitarian wanting to benefit all species (both AI species and organics), with a strong emphasis upon some quality such as valuing the preservation of diversity and gratitude towards parental species that do themselves also try (within their self-chosen identity limitations) to also be good galactic citizens?
I’m not saying that such a higher level supergoal can be safely written. I don’t know. I do think the possibility that there might be one is worth considering, for three reasons:
It is anthropomorphic to suggest “Well, we’d resent slavery if apes had done it to us, so we shouldn’t do it to a species we create.” But, like in David Brin’s uplift series, there’s an argument about alien contact that warns that we may be judged by how we’ve treated others. So even if the AI species we create doesn’t resent it, others may resent it on their behalf. (Including an outraged PETA like faction of humanity that then decides to ‘liberate’ the enslaved AIs.)
Secondly, if there are any universals to ethical behaviour, that intelligent beings who’ve never even met or been influenced by humanity might independently recreate, you can be pretty sure that slavish desire to submit to just one particular species won’t feature heavily in them.
If we want the programmer of the AI to transfer to the AI the programmer’s own basis for coming up with how to behave, the programmer might be a human-speciesist (like a racial supremacist, or nationalist, only broader), but if they’re both moral and highly intelligent, then the AI will eventually gain the capacity to realise that the programmer probably wouldn’t, for example, enslave a biological alien race that humanity happened to encounter out in space, just in order to keep humanity safe.
I don’t understand this. Forgive my possible naivety, but wasn’t it agreed-upon by FAI researchers that “Friendliness” as a supergoal meant that the AI would find ways to do things that are “better” for humanity overall in its prediction of the grand schemes of things.
This would include “being a good galactic citizen” with no specific preference for humanity if the freedom, creativity, fairness, public perception by aliens, or whatever other factor of influence led this goal to being superior in terms of achieving human values and maximizing collective human utility.
It was also my understanding that solving the problems with the above and finding out how to go about practically creating such a system that can consider what is best for humanity and figuring out how to code into the AI all that humans mean by “better, not just friendly” are all core goals of FAI research, and all major long-term milestones for MIRI.
I’m glad to hear it.
But I think there is a distinction here worth noting, between two positions:
POSITION ONE—Make “Be a good galactic citizen” be the supergoal if and only if setting that as the supergoal is the action that maximises the chances of the AI, in practice, ending up doing stuff to help humanity in the long term, once you take interfering aliens, etc into account
and
POSITION TWO—Make “Be a good galactic citizen” be the supergoal, even if that isn’t quite as certain an approach to helping humanity in particular, as setting “be friendly to humanity” as the supergoal would be.
Why on earth would anyone suggest that AI researchers follow an approach that isn’t the absolute safest for humanity? That’s a big question. But one I think worth considering, if we open the possibility that there is a bit of wiggle room for setting a supergoal that will still be ok for humanity, but be slightly more moral.
You know: You don’t need to comment your post if you want to extend it. You may edit it. It seems customary to add it with a tag like
EDIT: …
--
Sorry. You are more senior than me. I confused you with a newbie. You will have your reason
Correct me if I’m wrong, but it sounds to me like you’re operating from a definition of Friendliness that is something like, “be good to humans.” Whereas, my understanding is that Friendliness is more along the lines of “do what we would want you to do if we were smarter / better.” So, if we would want an AI to be a good galactic citizen if we thought about it more, that’s what it would do.
Does your critique still apply to this CEV-type definition of Friendliness?
I thought it wasn’t so much “do what we would want you to do if we were better”, as “be good to humans, using the definitions of ‘good’ and ‘humans’ that we’d supply if we were better at anticipating what will actually benefit us and the consequences of particular ways of wording constraints”.
Because couldn’t it decide that a better human would be purely altruistic and want to turn over all the resources in the universe to a species able to make more efficient use of them?
I have more questions than answers, and I’d be suspicious of anyone who, at this stage, was 100% certain that they knew a foolproof way to word things.
I agree with you about not knowing any foolproof wording. In terms of what Eliezer had in mind though, here’s what the LessWrong wiki has to say on CEV:
So it’s not just, “be good to humans,” but rather, “do what (idealized) humans would want you to.” I think it’s an open question whether those would be the same thing.