What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn’t supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.
...unless the team of humans is in a box :)
On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:
This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question.
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
So this is basically just like online supervised learning, except that we randomly determine which episodes we let humans label the data and train the Oracle, and which episodes we use the Oracle to produce answers that we actually use. See Paul’s Counterfactual oversight vs. training data where I got this explanation from. (What he calls counterfactual oversight is just counterfactual oracles applied to human imitation. It seems that he independently (re)invented the core idea.)
Let me know if it still doesn’t make sense, and I can try to explain more. (ETA: I actually wrote a top-level post about this.) This is also pretty similar to your HSIFAUH idea, except that you use expected information gain to determine when to let humans label the data instead of selecting randomly. I’m currently unsure what are the pros and cons of each. Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn’t supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.
...unless the team of humans is in a box :)
On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:
(“Q1”, Q1), (“Q2“, Q2), (“Q3”, Q3), …, (“Q26”, Q26), (“A1”, A1), (“A2“, A2), (“Q27”, Q27), … (“A10”,
This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
So this is basically just like online supervised learning, except that we randomly determine which episodes we let humans label the data and train the Oracle, and which episodes we use the Oracle to produce answers that we actually use. See Paul’s Counterfactual oversight vs. training data where I got this explanation from. (What he calls counterfactual oversight is just counterfactual oracles applied to human imitation. It seems that he independently (re)invented the core idea.)
Let me know if it still doesn’t make sense, and I can try to explain more. (ETA: I actually wrote a top-level post about this.) This is also pretty similar to your HSIFAUH idea, except that you use expected information gain to determine when to let humans label the data instead of selecting randomly. I’m currently unsure what are the pros and cons of each. Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.