Pursuing a provably-friendly AGI, even if very unlikely to succeed, could still be the right thing to do if it was certain that we’ll have a hard takeoff very soon after the creation of the first AGIs.
One consideration you’re missing (and that I expect to be true; Eliezer also points it out) is that even if there is very slow takeoff, creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI (they can’t be “debugged” when you don’t have accurate understanding of what it is you are aiming for; and they can’t be “asked” to solve a problem which you can’t accurately state). In this hypothetical, in the long run the unFriendly AGIs (or WBEs whose values have drifted away from original human values) will have control. So in this case it’s also necessary (if a little bit less urgent, which isn’t really enough to change the priority of the problem) to work on FAI theory, so hard takeoff is not decisively important in this respect.
(Btw, is this point in any of the papers? Do people agree it should be?)
(Btw, is this point in any of the papers? Do people agree it should be?)
Please clarify: Do you mean that since even a slow-takeoff AGI will eventually explode and become by default unfriendly, we have to work on FAI theory whether there will be a fast or a slow takeoff?
Yes, that seems straightforward, though I don’t know if it has been said explicitly.
But the question is whether we should also work on other approaches as stopgaps, whether during a slow take off or before a takeoff begins.
Statements to the effect that it’s necessary to argue that hard takeoff is probable/possible in order to motivate FAI research appear regularly, even your post left this same impression. I don’t think it’s particularly relevant, so having this argument written up somewhere might be useful.
since even a slow-takeoff AGI will eventually explode
Doesn’t need to explode, gradual growth into a global power strong enough to threaten humans is sufficient. With WBE value drift, there doesn’t even need to be any conflict or any AGI, humanity as a whole might lose its original values.
Statements to the effect that it’s necessary to argue that hard takeoff is probable/possible in order to motivate FAI research appear regularly, even your post left this same impression.
No, I didn’t want to give that impression. SI’s research direction is the most important one, regardless of whether we face a fast or slow takeoff. The question raised was whether other approaches are needed too.
It is a bad thing, in the sense that “bad” is whatever I (normatively) value less than the other available alternatives, and value-drifted WBEs won’t be optimizing the world in a way that I value. The property of valuing the world in a different way, and correspondingly of optimizing the world in a different direction which I don’t value as much, is the “value drift” I’m talking about. In other words, if it’s not bad, there isn’t much value drift; and if there is enough value drift, it is bad.
You’re right in a sense that we’d like to avoid it, but if it occurs gradually, it feels much more like “we just changed our minds” (like we definitely don’t value “honor” as much as the ancient greeks, etc), as compared to “we and our values were wiped out”.
The problem is not with “losing our values”, it’s about the future being optimized to something other than our values. The details of the process that leads to the incorrectly optimized future are immaterial, it’s the outcome that matters. When I say “our values”, I’m referring to a fixed idea, which doesn’t depend on what happens in the future, in particular it doesn’t depend on whether there are people with these or different values in the future.
I think one reason why people (including me, in the past) have difficulty accepting the way you present this argument is that you’re speaking in too abstract terms, while many of the values that we’d actually like to preserve are ones that we appreciate the most if we consider them in “near” mode. It might work better if you gave concrete examples of ways by which there could be a catastrophic value drift, like naming Bostrom’s all-work-and-no-fun scenario where
what will maximize fitness in the future will be nothing but non-stop high-intensity drudgery, work of a drab and repetitive nature, aimed at improving the eighth decimal of some economic output measure
creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI (they can’t be “debugged” when you don’t have accurate understanding of what it is you are aiming for; and they can’t be “asked” to solve a problem which you can’t accurately state)
Given that AGI has not been achieved yet, and that an FAI will be an AGI, it seems like any AGI would serve as a useful prototype and give insight in to what tends to work for creating general intelligences.
If the prototype AGIs are to be built by people concerned with friendliness, it seems like they could be even more useful… testing out the feasibility of techniques that seem promising for inclusion in an FAI’s source code, for instance, or checking for flaws in some safety proposal, or doing some kind of theorem-proving work.
creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI
If we use a model where building a uFAI requires only solving the AGI problem, and building FAI requires solving AGI + Friendliness—are you saying that it will not be of any help in developing Friendliness, or that it will not be of any help in developing AGI or Friendliness?
(The former claim would sound plausible though non-obvious, and the latter way too strong.)
No help in developing FAI theory (decision theory and a way of pointing to human values), probably of little help in developing FAI implementation, although there might be useful methods in common.
FAI requires solving AGI + Friendliness
I don’t believe it works like that. Making a poorly understood AGI doesn’t necessarily help with implementing a FAI (even if you have the theory figured out), as a FAI is not just parameterized by its values, but also defined by the correctness of interpretation of its values (decision theory), which other AGI designs by default won’t have.
Indeed—for example, on the F front, computational models of human ethical reasoning seem like something that could help increase the safety of all kinds of AGI projects and also be useful for Friendliness theory in general, and some of them could conceivably be developed in the context of heuristic AGI. Likewise, for the AGI aspect, it seems like there should be all kinds of machine learning techniques and advances in probability theory (for example) that would be equally useful for pretty much any kind of AGI—after all, we already know that an understanding of e.g. Bayes’ theorem and expected utility will be necessary for pretty much any kind of AGI implementation, so why should we assume that all of the insights that will be useful in many kinds of contexts would have been developed already?
Making a poorly understood AGI doesn’t necessarily help with implementing a FAI (even if you have the theory figured out)
Right, by the above I meant to say “the right kind of AGI + Friendliness”; I certainly agree that there are many conceivable ways of building AGIs that would be impossible to ever make Friendly.
slow-thinking unFriendly AGIs … not any help in developing a FAI
One suggestion is that slow-thinking unFriendly near-human AIs may indeed help develop an FAI:
(1) As a test bed, as a way of learning from examples.
(2) They can help figure things out. Of course, we don’t want them to be too smart, but dull nascent AGIs, if they don’t explode, might be some sort of research partner.
(To clarify, unFriendly means “without guaranteed Friendliness”, which is close but not identical to “guaranteed to kill us.”)
Ben Goertzel and Joel Pitt 2012 suggest the former for nascent AGIs. Carl Shulman’s recent article also suggests the latter for infrahuman WBEs.
One consideration you’re missing (and that I expect to be true; Eliezer also points it out) is that even if there is very slow takeoff, creation of slow-thinking poorly understood unFriendly AGIs is not any help in developing a FAI (they can’t be “debugged” when you don’t have accurate understanding of what it is you are aiming for; and they can’t be “asked” to solve a problem which you can’t accurately state). In this hypothetical, in the long run the unFriendly AGIs (or WBEs whose values have drifted away from original human values) will have control. So in this case it’s also necessary (if a little bit less urgent, which isn’t really enough to change the priority of the problem) to work on FAI theory, so hard takeoff is not decisively important in this respect.
(Btw, is this point in any of the papers? Do people agree it should be?)
Please clarify: Do you mean that since even a slow-takeoff AGI will eventually explode and become by default unfriendly, we have to work on FAI theory whether there will be a fast or a slow takeoff?
Yes, that seems straightforward, though I don’t know if it has been said explicitly.
But the question is whether we should also work on other approaches as stopgaps, whether during a slow take off or before a takeoff begins.
Statements to the effect that it’s necessary to argue that hard takeoff is probable/possible in order to motivate FAI research appear regularly, even your post left this same impression. I don’t think it’s particularly relevant, so having this argument written up somewhere might be useful.
Doesn’t need to explode, gradual growth into a global power strong enough to threaten humans is sufficient. With WBE value drift, there doesn’t even need to be any conflict or any AGI, humanity as a whole might lose its original values.
No, I didn’t want to give that impression. SI’s research direction is the most important one, regardless of whether we face a fast or slow takeoff. The question raised was whether other approaches are needed too.
The latter is not necessarily a bad thing though.
It is a bad thing, in the sense that “bad” is whatever I (normatively) value less than the other available alternatives, and value-drifted WBEs won’t be optimizing the world in a way that I value. The property of valuing the world in a different way, and correspondingly of optimizing the world in a different direction which I don’t value as much, is the “value drift” I’m talking about. In other words, if it’s not bad, there isn’t much value drift; and if there is enough value drift, it is bad.
You’re right in a sense that we’d like to avoid it, but if it occurs gradually, it feels much more like “we just changed our minds” (like we definitely don’t value “honor” as much as the ancient greeks, etc), as compared to “we and our values were wiped out”.
The problem is not with “losing our values”, it’s about the future being optimized to something other than our values. The details of the process that leads to the incorrectly optimized future are immaterial, it’s the outcome that matters. When I say “our values”, I’m referring to a fixed idea, which doesn’t depend on what happens in the future, in particular it doesn’t depend on whether there are people with these or different values in the future.
I think one reason why people (including me, in the past) have difficulty accepting the way you present this argument is that you’re speaking in too abstract terms, while many of the values that we’d actually like to preserve are ones that we appreciate the most if we consider them in “near” mode. It might work better if you gave concrete examples of ways by which there could be a catastrophic value drift, like naming Bostrom’s all-work-and-no-fun scenario where
or some similar example.
Given that AGI has not been achieved yet, and that an FAI will be an AGI, it seems like any AGI would serve as a useful prototype and give insight in to what tends to work for creating general intelligences.
If the prototype AGIs are to be built by people concerned with friendliness, it seems like they could be even more useful… testing out the feasibility of techniques that seem promising for inclusion in an FAI’s source code, for instance, or checking for flaws in some safety proposal, or doing some kind of theorem-proving work.
If we use a model where building a uFAI requires only solving the AGI problem, and building FAI requires solving AGI + Friendliness—are you saying that it will not be of any help in developing Friendliness, or that it will not be of any help in developing AGI or Friendliness?
(The former claim would sound plausible though non-obvious, and the latter way too strong.)
No help in developing FAI theory (decision theory and a way of pointing to human values), probably of little help in developing FAI implementation, although there might be useful methods in common.
I don’t believe it works like that. Making a poorly understood AGI doesn’t necessarily help with implementing a FAI (even if you have the theory figured out), as a FAI is not just parameterized by its values, but also defined by the correctness of interpretation of its values (decision theory), which other AGI designs by default won’t have.
Indeed—for example, on the F front, computational models of human ethical reasoning seem like something that could help increase the safety of all kinds of AGI projects and also be useful for Friendliness theory in general, and some of them could conceivably be developed in the context of heuristic AGI. Likewise, for the AGI aspect, it seems like there should be all kinds of machine learning techniques and advances in probability theory (for example) that would be equally useful for pretty much any kind of AGI—after all, we already know that an understanding of e.g. Bayes’ theorem and expected utility will be necessary for pretty much any kind of AGI implementation, so why should we assume that all of the insights that will be useful in many kinds of contexts would have been developed already?
Right, by the above I meant to say “the right kind of AGI + Friendliness”; I certainly agree that there are many conceivable ways of building AGIs that would be impossible to ever make Friendly.
One suggestion is that slow-thinking unFriendly near-human AIs may indeed help develop an FAI:
(1) As a test bed, as a way of learning from examples.
(2) They can help figure things out. Of course, we don’t want them to be too smart, but dull nascent AGIs, if they don’t explode, might be some sort of research partner.
(To clarify, unFriendly means “without guaranteed Friendliness”, which is close but not identical to “guaranteed to kill us.”)
Ben Goertzel and Joel Pitt 2012 suggest the former for nascent AGIs. Carl Shulman’s recent article also suggests the latter for infrahuman WBEs.
That’s the question: How long a run do we have?