Disagree on several points. I don’t need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven’t seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of “powerful optimizer” that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That’s what I’m trying to get at with my comment. “Goal-oriented” is not an answer, it’s not specific enough for us to make engineering progress on corrigibility.
I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It’s plausible to me that corrigibility—again, in this practical rather than mathematically elegant sense—is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it’s incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer’s laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate’s post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped “science of AI” that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.
I’m also unsure why you say shutdownability hasn’t been formalized. I feel like we’re confused about how to get shutdownability, not what it is.
In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will.
KataGo seems to be a system that is causally downstream of a process that has made it good at Go. To attempt to prevent itself from being shut down, KataGo would need to have some model of what it means to be ‘shut down’.
Comparing KataGo to humans when it comes to shutdownability is evidence of confusion.
Disagree on several points. I don’t need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven’t seen a good rigorous version.
As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn’t prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of “powerful optimizer” that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That’s what I’m trying to get at with my comment. “Goal-oriented” is not an answer, it’s not specific enough for us to make engineering progress on corrigibility.
I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It’s plausible to me that corrigibility—again, in this practical rather than mathematically elegant sense—is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it’s incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer’s laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate’s post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped “science of AI” that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.
I’m also unsure why you say shutdownability hasn’t been formalized. I feel like we’re confused about how to get shutdownability, not what it is.
KataGo seems to be a system that is causally downstream of a process that has made it good at Go. To attempt to prevent itself from being shut down, KataGo would need to have some model of what it means to be ‘shut down’.
Comparing KataGo to humans when it comes to shutdownability is evidence of confusion.