Wei Dai comments on Just Imitate Humans?

Wei Dai 9 Sep 2019 4:36 UTC
LW: 2 AF: 2
0
AF

but the question at stake is how plausible it is that a single AI team with some compute/data advantage relative to incautious AI teams could train ~HSIFAUH to phish well while other teams are still unable to train ~AIXI to take over the world. And the relevant question for evaluating that is whether d << h.

I don’t understand this part. Can you elaborate? Why is this the question at stake? Why is d << h the relevant question for evaluating this?

It seems like you’re imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that’s enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people’s defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?

By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?

ETA: What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
- michaelcohen 10 Sep 2019 1:29 UTC
  LW: 1 AF: 1
  0
  AF Parent
  What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
  Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn’t supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.
  ...unless the team of humans is in a box :)
  On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:
  (“Q1”, Q1), (“Q2“, Q2), (“Q3”, Q3), …, (“Q26”, Q26), (“A1”, A1), (“A2“, A2), (“Q27”, Q27), … (“A10”,
  This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.
  - Wei Dai 10 Sep 2019 4:20 UTC
    LW: 2 AF: 2
    0
    AF Parent
    
    Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question.
    
    Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
    
    So this is basically just like online supervised learning, except that we randomly determine which episodes we let humans label the data and train the Oracle, and which episodes we use the Oracle to produce answers that we actually use. See Paul’s Counterfactual oversight vs. training data where I got this explanation from. (What he calls counterfactual oversight is just counterfactual oracles applied to human imitation. It seems that he independently (re)invented the core idea.)
    
    Let me know if it still doesn’t make sense, and I can try to explain more. (ETA: I actually wrote a top-level post about this.) This is also pretty similar to your HSIFAUH idea, except that you use expected information gain to determine when to let humans label the data instead of selecting randomly. I’m currently unsure what are the pros and cons of each. Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
    
    Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
    - michaelcohen 13 Sep 2019 20:08 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
      Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
      Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
      It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
      Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
      Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
      (“Q1”, Q1), (“Q2“, Q2), (“Q3”, Q3), …, (“Q26”, Q26), (“A1”, A1), (“A2“, A2), (“Q27”, Q27), … (“A10”,
      It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
      - Wei Dai 14 Sep 2019 9:23 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Sorry, I think you misunderstood my question about combining human imitations with more general oracles/predictors. What I meant is that you could use general oracles/predictors to build models of the world, which the human imitators could then query or use to test out potential actions. This perhaps lets you overcome the problem of human imitators having worse world models than ~AIXI and narrows the capability gap between them.
        
        michaelcohen 16 Sep 2019 0:41 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.
- michaelcohen 10 Sep 2019 1:14 UTC
  LW: 1 AF: 1
  0
  AF Parent
  It seems like you’re imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that’s enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people’s defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?
  Well that was the question I originally posed here, but I got the sense from commenters was that people thought this was easy to pull off and the only question was whether it was safe. So I’m not sure for what N it’s the case that N machines running agents doing human-level stuff would be enough to take over the world. I’m pretty sure N = 7 billion is enough. And I think it’s plausible that after a discussion about this, I could become confident that N = 1000 was enough. Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N. So it seemed worth having a discussion, but I am not yet prepared to defend a low enough N which makes this obviously viable.
  Forgetting about the possibility of exponentially growing N for a moment, and turning to
  Why is d << h the relevant question for evaluating this?
  Yeah I wrote that post too quickly—this is wrong. (I was thinking of the leading team running HSIFAUH needing to go through d+h timesteps to get to a good performance, but they just need to run through d, which makes things easier.) Sorry about that. Let $f$ be the amount of compute that the leading project has divided by the compute that the leading reckless project has. Suppose d > 0. (That’s all we need actually). Then it takes the leading reckless team at least $f$ times as long to get to AIXI taking over the world as it takes the leading team to get to SolomonoffPredict predicting a human trying to do X; using similar tractable approximation strategies (whatever those turn out to be), we can expect it to take $f$ times as long for the leading reckless team to get to ~AIXI as it takes the leading team to get to ~SolomonoffPredict. ~HSIFAUH is more complicated with the resource of employing the humans you learn to imitate, but this resource requirement goes down by time you’re deploying it toward useful things. Naively (and you might be able to do better than this), you could run $f$ copies of ~HSIFAUH and get to human-level performance on some relevant tasks around the same time the reckless team takes over the world. So the question is whether N = $f$ is a big enough N. In the train-then-deploy framework, it seems today like training takes much more compute than deploying, so that makes it easier for the leading team to let N >> f, once all the resources dedicated to training get freed up. It should be possible to weaken the online version and get some of this speedup.
  By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?
  I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.
  - Wei Dai 14 Sep 2019 9:12 UTC
    LW: 2 AF: 2
    0
    AF Parent
    
    I’m pretty sure N = 7 billion is enough.
    
    Why? What are those 7 billion HSIFAUH doing?
    
    In another comment you said “If I’m understanding correctly, the concern is that the imitator learns how humans plan before learning what humans want, so then it plans like a human toward the achievement of some inhuman goal. I don’t think this causes an existential catastrophe.” But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?
    
    Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.
    
    How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?
    
    It should be possible to weaken the online version and get some of this speedup.
    
    What do you have in mind here?
    
    I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.
    
    You do have to solve some safety problems that the reckless team doesn’t though, don’t you? What do you think the main safety problems are?
    - michaelcohen 16 Sep 2019 0:37 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Why? What are those 7 billion HSIFAUH doing?
      Well the number comes from the idea of one-to-one monitoring. Obviously, there’s other stuff to do to establish a stable unipolar world order, but monitoring seems like the most resource intensive part, so it’s an order of magnitude estimate. Also, realistically, one person could monitor ten people, so that was an order of magnitude estimate with some leeway.
      But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?
      I think they can be controlled. Whoever is providing the observations to any instance of HSIFAUH has an arsenal of carrots and sticks (just by having certain observations correlate with actual physical events that occur in the household(s) of humans that generate the data), and I think merely human-level intelligence can kept in check by someone in a position of power over them. So I think real humans could stay at the wheel over 7 billion instances of HSIFAUH. (I mean, this is teetering at the edge of existential catastrophe already given the existence of simulations of people who might have the experience of being imprisoned, but I think with careful design of the training data, this could be avoided). But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.
      >Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.
      How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?
      Right, this analysis gets complicated because you have to analyze the growth rate of N. Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for. I hear Robin Hanson is the person to read regarding questions like this. I don’t have any opinions here. But the basic structure regarding “How?” is spend some fraction of computing resources making money, then buy more computing resources with that money.
      >It should be possible to weaken the online version and get some of this speedup.
      What do you have in mind here?
      Well, nothing in particular when I wrote that, but thank you for pushing me. Maybe only update the posterior at some timesteps (and do it infinitely many times but with diminishing frequency). Or more generally, you divide resources between searching for programs that retrodict observed behavior and running copies of the best one so far, and you just shift resource allocation toward the latter over time.
      You do have to solve some safety problems that the reckless team doesn’t though, don’t you? What do you think the main safety problems are?
      If it turns out you have to do special things to avoid mesa-optimizers, then yes. Otherwise, I don’t think you have to deal with other safety problems if you’re just aiming to imitate human behavior.
      - Wei Dai 16 Sep 2019 7:38 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        Obviously, there’s other stuff to do to establish a stable unipolar world order
        
        I was asking about this part. I’m not convinced HSIFAUH allows you to do this in a safe way (e.g., without triggering a war that you can’t necessarily win).
        
        Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for.
        
        Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don’t need to model humans), and start doing their own doublings.
        
        But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.
        
        I don’t think we’ve seen a solution that’s very robust though. Plus, having to maintain such a power structure starts to become a human safety problem for the real humans (i.e., potentially causes their values to become corrupted).
        
        michaelcohen 18 Sep 2019 1:46 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don’t need to model humans), and start doing their own doublings.
        Good point.
        Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that. But I don’t feel up to putting in those man-hours myself. It seems like there are lots of people without a technical background who are interested in helping avoid AI-based X-risk. Do you think this is a promising enough line of reasoning to be worth some people’s time?
        Wei Dai 19 Sep 2019 17:58 UTC
        LW: 4 AF: 3
        0
        AF Parent
        
        Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that.
        
        I’m pretty skeptical of this, but then I’m pretty skeptical of all current safety/alignment approaches and this doesn’t seem especially bad by comparison, so I think it might be worth including in a portfolio approach. But I’d like to better understand why you think it’s promising. Do you have more specific ideas of how ~HSIFAUH can be used to achieve a Singleton and to keep it safe, or just a general feeling that it should be possible?
        
        michaelcohen 19 Sep 2019 18:58 UTC
        LW: 1 AF: 1
        0
        AF Parent
        My intuitions are mostly that if you can provide significant rewards and punishments basically for free in imitated humans (or more to the point, memories thereof), and if you can control the flow of information throughout the whole apparatus, and you have total surveillance automatically, this sort of thing is a dictator’s dream. Especially because it usually costs money to make people happy, and in this case, it hardly does—just a bit of computation time. In a world with all the technology in place that a dictator could want, but also it’s pretty cheap to make everyone happy, it strikes me as promising that the system itself could be kept under control.