Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.