I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.
Also, I think at the end of the day, the most likely catastrophic AI action will be “the AI was already given practically root access to its datacenter, and the AI will be directly put in control of its own training, because the most reckless team won’t care about safety and unless the AI actively signals that it is planning to act adversarially towards the developers, people will just keep pouring resources into it with approximately no safeguards, and then it kills everyone”. I do hope we end up in worlds where we try harder than that (and that seems achievable), but it seemed good to state that I expect us to fail at an even earlier stage than this story seems to imply.
I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.
Except for maybe “producing a custom chip”, I agree with these as other possibilities, and I think they’re in line with the point I wanted to make, which is that the catastrophic action involves taking someone else’s resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act.
Does this distinction make sense?
Maybe this would have been clearer if I’d titled it “AI catastrophic actions are mostly not pivotal acts”?
Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important.
Having settled this, my primary response is:
Sure, I guess it’s the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven’t solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more “positive-sum” action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the “prototypical catastrophic action” feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either.
I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it’s “mostly”, because we are going to die in a low-dignity way as Eliezer would say, but it’s not obviously where most of the difficulty lies).
I’m using “catastrophic” in the technical sense of “unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time”, rather than “very bad thing that happens because of AI”, apologies if this was confusing.
My guess is that you will wildly disagree with the frame I’m going to use here, but I’ll just spell it out anyway: I’m interested in “catastrophes” as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these “positive-sum” pivotal acts in a single action, and you haven’t already lost control, then you can use your overseer to oversee the AI as it takes actions, and you by assumption only have to watch it for a small number of actions (maybe I want to say episodes rather than actions) before it’s done some crazy powerful stuff and saved the world. So I think I stand by the claim that those pivotal acts aren’t where much of the x-risk from AI catastrophic action (in the specific sense I’m using) comes from.
Thanks again for your thoughts here, they clarified several things for me.
I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.
Also, I think at the end of the day, the most likely catastrophic AI action will be “the AI was already given practically root access to its datacenter, and the AI will be directly put in control of its own training, because the most reckless team won’t care about safety and unless the AI actively signals that it is planning to act adversarially towards the developers, people will just keep pouring resources into it with approximately no safeguards, and then it kills everyone”. I do hope we end up in worlds where we try harder than that (and that seems achievable), but it seemed good to state that I expect us to fail at an even earlier stage than this story seems to imply.
Except for maybe “producing a custom chip”, I agree with these as other possibilities, and I think they’re in line with the point I wanted to make, which is that the catastrophic action involves taking someone else’s resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act.
Does this distinction make sense?
Maybe this would have been clearer if I’d titled it “AI catastrophic actions are mostly not pivotal acts”?
Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important.
Having settled this, my primary response is:
Sure, I guess it’s the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven’t solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more “positive-sum” action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the “prototypical catastrophic action” feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either.
I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it’s “mostly”, because we are going to die in a low-dignity way as Eliezer would say, but it’s not obviously where most of the difficulty lies).
I’m using “catastrophic” in the technical sense of “unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time”, rather than “very bad thing that happens because of AI”, apologies if this was confusing.
My guess is that you will wildly disagree with the frame I’m going to use here, but I’ll just spell it out anyway: I’m interested in “catastrophes” as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these “positive-sum” pivotal acts in a single action, and you haven’t already lost control, then you can use your overseer to oversee the AI as it takes actions, and you by assumption only have to watch it for a small number of actions (maybe I want to say episodes rather than actions) before it’s done some crazy powerful stuff and saved the world. So I think I stand by the claim that those pivotal acts aren’t where much of the x-risk from AI catastrophic action (in the specific sense I’m using) comes from.
Thanks again for your thoughts here, they clarified several things for me.