It does not make sense to me to say “it becomes a coffee maximizer as an instrumental goal.” Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it’s only a “coffee maximizer” within the boundary of states that are equally corrigible. As an analogue, let’s say you’re hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a “getting in the car maximizer.”
One perspective that might help is that of a whitelist. Corrigible agents don’t need to learn the human’s preferences to learn what’s bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok.
A corrigible agent won’t want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured—instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default “cure cancer” is bad, just as all actions with large changes to the world are bad.
Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I’ll work harder to respond more quickly in the near future.)
It does not make sense to me to say “it becomes a coffee maximizer as an instrumental goal.” Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it’s only a “coffee maximizer” within the boundary of states that are equally corrigible. As an analogue, let’s say you’re hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a “getting in the car maximizer.”
One perspective that might help is that of a whitelist. Corrigible agents don’t need to learn the human’s preferences to learn what’s bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok.
A corrigible agent won’t want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured—instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default “cure cancer” is bad, just as all actions with large changes to the world are bad.
Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I’ll work harder to respond more quickly in the near future.)
Thanks for the clarification, this makes sense! The key is the tradeoff with corrigibility.