This comment raises some good points, but even “there will be a natural pressure for [subprocesses] to resemble a corrigible agent” seem to be debatable. Again consider the restaurant setting. It is sometime necessary for restaurants to close temporarily for renovation to increase the seating capacity, upgrade equipment, etc. The head chef who decided to renovate will be making the instrumental goals of all the other chefs (make a good food, earn money to stay alive) untenable while they are furloughed. More generally, progress towards terminal goals is not monotonic and thus only focusing on the local topology of the optimization landscape might be insufficient to predict long-horizon trends.
This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.
To be fair, I think the shutdownableness of an AI/not subverting higher level goals was the original motivation of all the corrigibility research, so this is a good thing.
This comment raises some good points, but even “there will be a natural pressure for [subprocesses] to resemble a corrigible agent” seem to be debatable. Again consider the restaurant setting. It is sometime necessary for restaurants to close temporarily for renovation to increase the seating capacity, upgrade equipment, etc. The head chef who decided to renovate will be making the instrumental goals of all the other chefs (make a good food, earn money to stay alive) untenable while they are furloughed. More generally, progress towards terminal goals is not monotonic and thus only focusing on the local topology of the optimization landscape might be insufficient to predict long-horizon trends.
This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.
To be fair, I think the shutdownableness of an AI/not subverting higher level goals was the original motivation of all the corrigibility research, so this is a good thing.