Suppose that we do live in a hard-takeoff singleton world, what then?
What sort of evidence are you envisioning that would allow us to determine that we live in a hard takeoff singleton world, and that the proposed pivotal act would actually work, ahead of actually attempting said pivotal act? I can think of a couple options:
We have no such evidence, but we can choose an act that is only pivotal if the underlying world model that leads you to expect a hard takeoff singleton world actually holds, and harmlessly fails otherwise.
Galaxy brained game theory arguments, of the flavor John von Neumann made when he argued for preemptive nuclear strike on the Soviet Union.
Someone has done a lot of philosophical thinking, and come to the conclusion that something apocalyptically bad will happen in the near future. In order to prevent the bad thing from happening, they need to do something extremely destructive and costly that they say will prevent the apocalyptic event. What evidence do you want from that person before you are happy to have them do the destructive and costly thing?
I don’t have to know in advance that we’re in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology).
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
I’m not sure what is “extremely destructive and costly” in what I described? Unless you mean the risk of misalignment, in which case, see above.
The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat “apples and oranges”. Suppose that we do live in a hard-takeoff singleton world, what then?
It sounds like you do in fact believe we are in a hard-takeoff singleton world, or at least one in which a single actor can permanently prevent all other actors from engaging in catastrophic actions using a less destructive approach than “do unto others before they can do unto you”. Why do you think that describes the world we live in? What observations led you to that conclusion, and do you think others would come to the same conclusion if they saw the same evidence?
I think your set of guidelines from above is mostly[1] a good one, in worlds where a single actor can seize control while following those rules. I don’t think that we live in such a world, and honestly I can’t really imagine what sort of evidence would convince me that I do live in such a world though. Which is why I’m asking.
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
Yeah, on examination of the comment section I think you’re right that by “from the null string” he meant “without direct social inputs on this particular topic”.
“Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”. To see why that’s a bad idea to allow, consider the situation where there are two such projects and you, the bystander, don’t know which one will succeed first.
I don’t know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).
I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.
Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”.
Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don’t make threats, i.e. don’t try to shape others incentives in ways that they would be better off precommitting not to go along with it.
Yeah, I’m not actually worried about the “melt all GPUs” example of a pivotal act. If we actually live in a hard takeoff world, I think we’re probably just hosed. The specific plans I’m worried about are ones that ever-so-marginally increase our chances of survival in hard-takeoff singleton worlds, at massive costs in multipolar worlds.
A full nuclear exchange would probably kill less than a billion people. If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI, I would still strongly prefer that person not try their hardest to trigger a nuclear exchange. More generally, I think having a policy of “anyone who thinks the world will end unless they take some specific action should go ahead and take that action, as long as less than a billion people die” is a terrible policy.
If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI
I think the problem here is “convinces themself”. If you are capable to trigger nuclear war, you are probably capable to do something else which is not that, if you put your mind in that.
Does the” something else which is not that but is in the same difficulty class” also accomplish the goal of “ensure that nobody has access to what you think is enough compute to build an ASI?” If not, I think that implies that the “anything that probably kills less than a billion people is fair game” policy is a bad one.
What sort of evidence are you envisioning that would allow us to determine that we live in a hard takeoff singleton world, and that the proposed pivotal act would actually work, ahead of actually attempting said pivotal act? I can think of a couple options:
We have no such evidence, but we can choose an act that is only pivotal if the underlying world model that leads you to expect a hard takeoff singleton world actually holds, and harmlessly fails otherwise.
Galaxy brained game theory arguments, of the flavor John von Neumann made when he argued for preemptive nuclear strike on the Soviet Union.
Something else entirely
My worry, given the things Yudkowsky has said like “I figured this stuff out using the null string as input”, is that the argument is closer to (2).
So to reframe the question:
Someone has done a lot of philosophical thinking, and come to the conclusion that something apocalyptically bad will happen in the near future. In order to prevent the bad thing from happening, they need to do something extremely destructive and costly that they say will prevent the apocalyptic event. What evidence do you want from that person before you are happy to have them do the destructive and costly thing?
I don’t have to know in advance that we’re in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology).
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
I’m not sure what is “extremely destructive and costly” in what I described? Unless you mean the risk of misalignment, in which case, see above.
This was specifically in response to
It sounds like you do in fact believe we are in a hard-takeoff singleton world, or at least one in which a single actor can permanently prevent all other actors from engaging in catastrophic actions using a less destructive approach than “do unto others before they can do unto you”. Why do you think that describes the world we live in? What observations led you to that conclusion, and do you think others would come to the same conclusion if they saw the same evidence?
I think your set of guidelines from above is mostly[1] a good one, in worlds where a single actor can seize control while following those rules. I don’t think that we live in such a world, and honestly I can’t really imagine what sort of evidence would convince me that I do live in such a world though. Which is why I’m asking.
Yeah, on examination of the comment section I think you’re right that by “from the null string” he meant “without direct social inputs on this particular topic”.
“Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”. To see why that’s a bad idea to allow, consider the situation where there are two such projects and you, the bystander, don’t know which one will succeed first.
I don’t know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).
I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.
Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don’t make threats, i.e. don’t try to shape others incentives in ways that they would be better off precommitting not to go along with it.
If you are capable to use AI to do harmful and costly thing, like “melt GPUs”, you are in hard takeoff world.
Yeah, I’m not actually worried about the “melt all GPUs” example of a pivotal act. If we actually live in a hard takeoff world, I think we’re probably just hosed. The specific plans I’m worried about are ones that ever-so-marginally increase our chances of survival in hard-takeoff singleton worlds, at massive costs in multipolar worlds.
A full nuclear exchange would probably kill less than a billion people. If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI, I would still strongly prefer that person not try their hardest to trigger a nuclear exchange. More generally, I think having a policy of “anyone who thinks the world will end unless they take some specific action should go ahead and take that action, as long as less than a billion people die” is a terrible policy.
I think the problem here is “convinces themself”. If you are capable to trigger nuclear war, you are probably capable to do something else which is not that, if you put your mind in that.
Does the” something else which is not that but is in the same difficulty class” also accomplish the goal of “ensure that nobody has access to what you think is enough compute to build an ASI?” If not, I think that implies that the “anything that probably kills less than a billion people is fair game” policy is a bad one.