I usually think of goal stability is something that improves as the agent becomes more intelligent; to the extent that a goal isn’t stable we treat it as a failure of capabilities.
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
if you have a 95% correct definition of corrigibility the resulting agent will help us get to the 100% version.
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)
Well, sure, you can call it that. It seems a bit misleading to me, in the sense that usually “failure of capabilities” implies “If we can make more capable AIs, the problem goes away”. Here, the question is whether “smart enough to figure out how to keep its goals stable” comes before or after “smart enough to be dangerous if its goals drift” during the learning process. If we develop approaches to make more capable AIs, that’s not necessarily helpful for switching the order of which of those two milestones happens first. Maybe there’s some solution related to careful cultivation of differential capabilities. But I would still much rather that we humans solve the problem in advance (or prove that it’s unsolvable). :-P
I guess my response would be that something pursuing a goal of Always do what the supervisor wants me to do* [*...but I don’t want to cause the extinction of amazonian frogs] might naively seem to be >99.9% corrigible—the amazonian frogs thing is very unlikely to ever come up!—but it is definitely not corrigible, and it will work to undermine the supervisor’s efforts to make it 100% corrigible. Maybe we should say that this system is actually 0% corrigible? Anyway, I accept that there is some definition of “95% corrigible” for which it’s true that “a 95% corrigible agent will help us make it 100% corrigible”. I think that finding such a definition would be super-useful. :-)