johnswentworth comments on The case for aligning narrowly superhuman models

johnswentworth 6 Mar 2021 23:06 UTC
LW: 5 AF: 4
AF
How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.
- Rohin Shah 7 Mar 2021 0:45 UTC
  LW: 5 AF: 5
  AF Parent
  Yeah, sorry, that’s right, I was speaking pretty loosely. You’d still have the same hope—maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.
  - johnswentworth 7 Mar 2021 5:44 UTC
    LW: 7 AF: 4
    AF Parent
    Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don’t see any reason at all to expect it to do anything remotely similar to that.
    - Ajeya Cotra 7 Mar 2021 6:15 UTC
      LW: 7 AF: 5
      AF Parent
      The intuition for it is something like this: suppose I’m trying to make a difficult decision, like where to buy a house. There are hundreds of cities I’d be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.
      
      If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of “holistic judgment of neighborhood X”, and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.
      - johnswentworth 7 Mar 2021 6:50 UTC
        LW: 7 AF: 5
        AF Parent
        I see, so it’s basically assuming that problems factor.
        Ajeya Cotra 7 Mar 2021 7:07 UTC
        LW: 7 AF: 3
        AF Parent
        Yeah, in the context of a larger alignment scheme, it’s assuming that in particular the problem of answering the question “How good is the AI’s proposed action?” will factor down into sub-questions of manageable size.
    - Raemon 7 Mar 2021 6:00 UTC
      LW: 6 AF: 4
      AF Parent
      I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)
      - paulfchristiano 7 Mar 2021 8:19 UTC
        LW: 9 AF: 7
        AF Parent
        That’s what I have in mind. If all goes well you can think of it like “a human thinking a long time.” We don’t know if all will go well.
        It’s also not really clear what “a human thinking 10,000 years” means, HCH is kind of an operationalization of that, but there’s a presumption of alignment in the human-thinking-a-long-time that we don’t get for free here. (Of course you also wouldn’t get it for free if you somehow let a human live for 10,000 years...)
        What links here?
        johnswentworth's comment on The Plan by johnswentworth (12 Dec 2021 17:39 UTC; 9 points)
    - adamShimi 7 Mar 2021 21:53 UTC
      LW: 5 AF: 4
      AF Parent
      Well, Paul’s original post presents HCH as the specification of a human enlightened judgement.
      For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.
      And if we follow the links to Paul’s previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.
      To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.
      So it looks to me like “HCH captures the judgment of the human after thinking from a long time” is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don’t know the answer.
      A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.
    - Rohin Shah 7 Mar 2021 16:52 UTC
      LW: 5 AF: 5
      AF Parent
      I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:
      Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from???
      … I don’t really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.
      I don’t see any reason at all to expect it to do anything remotely similar to that.
      Tbc it doesn’t need to be literally true. The argument needed for safety is something like “a large team of copies of non-expert agents could together be as capable as an expert”. I see the argument “it’s probably possible for a team of agents to mimic one agent thinking for a long time” as mostly an intuition pump for why that might be true.
      - johnswentworth 7 Mar 2021 17:33 UTC
        LW: 5 AF: 4
        AF Parent
        “As capable as an expert” makes more sense. Part of what’s confusing about “equivalent to a human thinking for a long time” is that it’s picking out one very particular way of achieving high capability, but really it’s trying to point to a more-general notion of “HCH can solve lots of problems well”. Makes it sound like there’s some structural equivalence to a human thinking for a long time, which there isn’t.
        Rohin Shah 7 Mar 2021 18:46 UTC
        LW: 6 AF: 5
        AF Parent
        Makes it sound like there’s some structural equivalence to a human thinking for a long time, which there isn’t.
        Yes, I explicitly agree with this, which is why the first thing in my previous response was
        sorry, that’s right, I was speaking pretty loosely.
- Ajeya Cotra 7 Mar 2021 0:01 UTC
  LW: 3 AF: 3
  AF Parent
  My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.
  - johnswentworth 7 Mar 2021 0:47 UTC
    LW: 11 AF: 5
    AF Parent
    HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.
    (This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)
    - Ajeya Cotra 7 Mar 2021 5:08 UTC
      LW: 1 AF: 1
      AF Parent
      Yes sorry — I’m aware that in the HCH procedure no one human thinks for a long time. I’m generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could “effectively replicate the benefits you could get from having a human thinking a long time,” in terms of the role that it plays in an overall scheme for alignment. This isn’t guaranteed to work out, of course. My position is similar to Rohin’s above:
      
      I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.