If you ask a corrigible agent to bring you a cup of coffee, it should confirm that you want a hot cup of simple, black coffee, then internally check to make sure that the cup won’t burn you, that nobody will be upset at the coffee being moved or consumed, that the coffee won’t be spilled, and so on. But it will also, after performing these checks, simply do what’s instructed. A corrigible agent’s actions should be straightforward, easy to reverse and abort, plainly visible, and comprehensible to a human who takes time to think about them. Corrigible agents proactively study themselves, honestly report their own thoughts, and point out ways in which they may have been poorly designed. A corrigible agent responds quickly and eagerly to corrections, and shuts itself down without protest when asked. Furthermore, small flaws and mistakes when building such an agent shouldn’t cause these behaviors to disappear, but rather the agent should gravitate towards an obvious, simple reference-point.
Isn’t corrigibility still susceptible to power-seeking according to this definition? It wants to bring you a cup of coffee, it notices the chances of spillage are reduced if it has access to more coffee, so it becomes a coffee maximizer as in instrumental goal.
Now, it is still corrigible, it does not hide it’s thought processes, it tells the human exactly what it is doing and why. But when the agent is doing millions of decisions and humans can only review so many thought processes (only so many humans will take the time to think about the agent’s actions), many decisions will fall through the crack and end up being misaligned.
Is the goal to learn the human’s preferences through interaction then, and hope that it learns the preferences enough to know that power-seeking (and other harmful behaviors) are bad?
The problem is, there could be harmful behaviors we haven’t thought of to train the AI in, and they are never corrected, so the AI proceeds with them.
If so, can we define a corrigible agent that is actually what we want?
It does not make sense to me to say “it becomes a coffee maximizer as an instrumental goal.” Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it’s only a “coffee maximizer” within the boundary of states that are equally corrigible. As an analogue, let’s say you’re hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a “getting in the car maximizer.”
One perspective that might help is that of a whitelist. Corrigible agents don’t need to learn the human’s preferences to learn what’s bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok.
A corrigible agent won’t want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured—instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default “cure cancer” is bad, just as all actions with large changes to the world are bad.
Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I’ll work harder to respond more quickly in the near future.)
Isn’t corrigibility still susceptible to power-seeking according to this definition? It wants to bring you a cup of coffee, it notices the chances of spillage are reduced if it has access to more coffee, so it becomes a coffee maximizer as in instrumental goal.
Now, it is still corrigible, it does not hide it’s thought processes, it tells the human exactly what it is doing and why. But when the agent is doing millions of decisions and humans can only review so many thought processes (only so many humans will take the time to think about the agent’s actions), many decisions will fall through the crack and end up being misaligned.
Is the goal to learn the human’s preferences through interaction then, and hope that it learns the preferences enough to know that power-seeking (and other harmful behaviors) are bad?
The problem is, there could be harmful behaviors we haven’t thought of to train the AI in, and they are never corrected, so the AI proceeds with them.
If so, can we define a corrigible agent that is actually what we want?
It does not make sense to me to say “it becomes a coffee maximizer as an instrumental goal.” Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it’s only a “coffee maximizer” within the boundary of states that are equally corrigible. As an analogue, let’s say you’re hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a “getting in the car maximizer.”
One perspective that might help is that of a whitelist. Corrigible agents don’t need to learn the human’s preferences to learn what’s bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok.
A corrigible agent won’t want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured—instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default “cure cancer” is bad, just as all actions with large changes to the world are bad.
Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I’ll work harder to respond more quickly in the near future.)
Thanks for the clarification, this makes sense! The key is the tradeoff with corrigibility.