This post uses the phrase “Bostrom’s original instrumental convergence thesis”. I’m not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper uses the term “many intelligent agents” instead of “a broad spectrum of situated intelligent agents”.
In case it’ll be helpful to anyone, the formulation of the thesis in the book Superintelligence is the following:
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.
I’m not sure what you meant here by saying that the instrumental convergence thesis “needs to be applied carefully”, and how the example you gave supports this. Even in environments where the agent is “alone”, we may still expect the agent to have the following potential convergent instrumental values (which are all mentioned both in the linked paper and in the book Superintelligence as categories where “convergent instrumental values may be found”): self-preservation, cognitive enhancement, technological perfection and resource acquisition.
Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.
I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there’s an entire paragraph about Stephen Omohundro’s work on the topic (including citations of Omohundro’s “two pioneering papers on this topic”).
Even in environments where the agent is “alone”, we may still expect the agent to have the following potential convergent instrumental values
Right. But I think I sometimes bump into reasoning that feels like “instrumental convergence, smart AI, & humans exist in the universe → bad things happen to us / the AI finds a way to hurt us”; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)
This post uses the phrase “Bostrom’s original instrumental convergence thesis”. I’m not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper uses the term “many intelligent agents” instead of “a broad spectrum of situated intelligent agents”.
In case it’ll be helpful to anyone, the formulation of the thesis in the book Superintelligence is the following:
I’m not sure what you meant here by saying that the instrumental convergence thesis “needs to be applied carefully”, and how the example you gave supports this. Even in environments where the agent is “alone”, we may still expect the agent to have the following potential convergent instrumental values (which are all mentioned both in the linked paper and in the book Superintelligence as categories where “convergent instrumental values may be found”): self-preservation, cognitive enhancement, technological perfection and resource acquisition.
Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.
I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there’s an entire paragraph about Stephen Omohundro’s work on the topic (including citations of Omohundro’s “two pioneering papers on this topic”).
Right. But I think I sometimes bump into reasoning that feels like “instrumental convergence, smart AI, & humans exist in the universe → bad things happen to us / the AI finds a way to hurt us”; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)
(Edited post to clarify)