Paul has written about an open
problem:
can we use active learning to get a good predictor, if all we have is a
mixture of a good predictor and some possibly bad predictors? I’d like to
propose a refinement to the scheme.
A simplified version of Paul’s problem is the following.
Suppose we have n predictors. These predictors might, for example, predict
how a human would answer a given question. One is known to be good; the rest
may be arbitrarily evil. We don’t know which predictor is good. How do
we use this set of predictors to imitate a human almost as well as the good predictor?
Paul proposes using active learning to identify questions on which these
predictors disagree, then asking them to the human. This allows
the system to eliminate some predictors from consideration when they
make bad predictions. However, if any of the predictors is evil, then
it might intentionally disagree with the good predictor on a “weird” question,
which causes bad results when it is asked to the human. This seems like
a serious problem, since due to simulation warfare,
it does not seem unlikely for one of the predictors to actually be evil.
I propose the following refinement. The system should consider a stream of questions
coming in and answer each one before seeing the next. It should examine each
question and see if the predictors that have scored well in the past disagree
significantly on their predictions of the answer to this question. If so, then
it should ask the question to the human, output the human’s answer, and give each
predictor a score based on how good its prediction of the human’s answer was. If not, then it can go
ahead and output the consensus prediction. Ideally, we could prove that this
algorithm will not ask the human very many questions (preferably, not much more
than linear in n). This setting is similar to the setting of online
selective sampling,
especially the KWIK
(“knows what it knows”) variant.
A single predictor cannot cause the system to ask the human an arbitrary question.
It can bide its time and wait for a particular question to come up naturally,
but this does not give it much power: a question that would come up
naturally is unlikely to be bad to ask to the human, if the rest of the system
is set up correctly.
One problem is that the predictors might disagree at an inopportune time,
when we would really prefer for the system not to have to “pause” to ask the human a question.
Paul discusses this problem
as a general failure mode for counterfactual oversight. I think there’s
a partial solution: try to predict what questions will come up in the future,
finding one that the predictors disagree on. This search should not be
too powerful: in particular, it should avoid looking at the internals
of the predictors, and it should avoid searching through a huge number
of possibilities. If the search for the controversial question is
not powerful, then the question will have low mutual information
with the predictors’ internals, preventing the predictors from
exposing the human to highly optimized questions. KWIK-style
algorithms generally continue to work in this setting.
I’m currently working on an algorithm that can handle this application
better than other selective sampling algorithms can; see
Paul’s post
for more details on what makes this application special.
I previously made the mistake of attacking this problem without
doing enough of a literature search to find the selective sampling
literature. I think I will have a better chance of making progress
on this problem after reading over this literature more.
A scheme for safely handling a mixture of good and bad predictors
Paul has written about an open problem: can we use active learning to get a good predictor, if all we have is a mixture of a good predictor and some possibly bad predictors? I’d like to propose a refinement to the scheme.
A simplified version of Paul’s problem is the following. Suppose we have n predictors. These predictors might, for example, predict how a human would answer a given question. One is known to be good; the rest may be arbitrarily evil. We don’t know which predictor is good. How do we use this set of predictors to imitate a human almost as well as the good predictor?
Paul proposes using active learning to identify questions on which these predictors disagree, then asking them to the human. This allows the system to eliminate some predictors from consideration when they make bad predictions. However, if any of the predictors is evil, then it might intentionally disagree with the good predictor on a “weird” question, which causes bad results when it is asked to the human. This seems like a serious problem, since due to simulation warfare, it does not seem unlikely for one of the predictors to actually be evil.
I propose the following refinement. The system should consider a stream of questions coming in and answer each one before seeing the next. It should examine each question and see if the predictors that have scored well in the past disagree significantly on their predictions of the answer to this question. If so, then it should ask the question to the human, output the human’s answer, and give each predictor a score based on how good its prediction of the human’s answer was. If not, then it can go ahead and output the consensus prediction. Ideally, we could prove that this algorithm will not ask the human very many questions (preferably, not much more than linear in n). This setting is similar to the setting of online selective sampling, especially the KWIK (“knows what it knows”) variant.
A single predictor cannot cause the system to ask the human an arbitrary question. It can bide its time and wait for a particular question to come up naturally, but this does not give it much power: a question that would come up naturally is unlikely to be bad to ask to the human, if the rest of the system is set up correctly.
One problem is that the predictors might disagree at an inopportune time, when we would really prefer for the system not to have to “pause” to ask the human a question. Paul discusses this problem as a general failure mode for counterfactual oversight. I think there’s a partial solution: try to predict what questions will come up in the future, finding one that the predictors disagree on. This search should not be too powerful: in particular, it should avoid looking at the internals of the predictors, and it should avoid searching through a huge number of possibilities. If the search for the controversial question is not powerful, then the question will have low mutual information with the predictors’ internals, preventing the predictors from exposing the human to highly optimized questions. KWIK-style algorithms generally continue to work in this setting.
I’m currently working on an algorithm that can handle this application better than other selective sampling algorithms can; see Paul’s post for more details on what makes this application special. I previously made the mistake of attacking this problem without doing enough of a literature search to find the selective sampling literature. I think I will have a better chance of making progress on this problem after reading over this literature more.