Might not intent alignment (doing what a human wants it to do, being helpful) be a better target than obedience (doing what a human told it to do)?
I should clarify that when I think about obedience, I’m thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I’m open to arguments about whether it’s better to talk in terms of one or the other. I guess I favour “obedience” because it has fewer connotations of agency—if you’re “doing what a human wants you to do”, then you might run off and do things before receiving any instructions. (Also because it’s shorter and pithier—“the goal of doing what humans want” is a bit of a mouthful).
Ah, ok. When you said “obedience” I imagined too little agency — an agent that wouldn’t stop to ask clarifying questions. But I think we’re on the same page regarding the flavor of the objective.