“Counterfactual oversight” isn’t really explained in the post I linked to, but rather in the first post that post links to, titled Human-in-the-counterfactual-loop. (The link text is “counterfactual human oversight”. Yeah having all these different names is confusing, but these three phrases are referring to the same thing.) The key part of that post is this:
Human-in-the-counterfactual-loop. Each time the system wants to act, it consults a human with a very small probability. The system does what it thinks a human would have told it to do if the human had been consulted.
So the “what it thinks a human would have told it to do if the human had been consulted” is the human imitation part, because it’s predicting what a human would do, which is the same as imitating a human. And then the oversight part is that the human imitation is telling the system what to do. (My “oversee” is referring to this “oversight”.)
I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.
“Human imitation” is sometimes used to refer to the outward behavior of the system (e.g. “imitation learning”, and in posts like “Just Imitate Humans?”), and sometimes to refer to the model of the human inside the system (e.g. here when you say “the human imitation is telling the system what to do”).
A system that is more capable than a human can still be a “human imitation”, because “human imitation” is being used in the sense of “modeling humans inside the system” instead of “has the outward behavior of a human”.
There is a distinction between the counterfactual training procedure vs the resulting system. “Counterfactual oracle” (singular) seems to be used to refer to the resulting system, and Paul calls this “the system” in his “Human-in-the-counterfactual-loop” post. “Counterfactual oracles” (plural) is used both as a plural version of the resulting system and also as a label for the general training procedure. “Human-in-the-counterfactual-loop”, “counterfactual human oversight”, and “counterfactual oversight” all refer to the training procedure (but only when the procedure uses a model of the human).
“Counterfactual oversight” isn’t really explained in the post I linked to, but rather in the first post that post links to, titled Human-in-the-counterfactual-loop. (The link text is “counterfactual human oversight”. Yeah having all these different names is confusing, but these three phrases are referring to the same thing.) The key part of that post is this:
So the “what it thinks a human would have told it to do if the human had been consulted” is the human imitation part, because it’s predicting what a human would do, which is the same as imitating a human. And then the oversight part is that the human imitation is telling the system what to do. (My “oversee” is referring to this “oversight”.)
I hope that clears things up?
Thanks! I think I understand this now.
I will say some things that occurred to me while thinking more about this, and hope that someone will correct me if I get something wrong.
“Human imitation” is sometimes used to refer to the outward behavior of the system (e.g. “imitation learning”, and in posts like “Just Imitate Humans?”), and sometimes to refer to the model of the human inside the system (e.g. here when you say “the human imitation is telling the system what to do”).
A system that is more capable than a human can still be a “human imitation”, because “human imitation” is being used in the sense of “modeling humans inside the system” instead of “has the outward behavior of a human”.
There is a distinction between the counterfactual training procedure vs the resulting system. “Counterfactual oracle” (singular) seems to be used to refer to the resulting system, and Paul calls this “the system” in his “Human-in-the-counterfactual-loop” post. “Counterfactual oracles” (plural) is used both as a plural version of the resulting system and also as a label for the general training procedure. “Human-in-the-counterfactual-loop”, “counterfactual human oversight”, and “counterfactual oversight” all refer to the training procedure (but only when the procedure uses a model of the human).