Ajeya Cotra comments on The case for aligning narrowly superhuman models

Ajeya Cotra 6 Mar 2021 2:06 UTC
LW: 19 AF: 12
AF
Thanks for the comment! Just want to explicitly pull out and endorse this part:

the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process

I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the “sandwich” problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also take steps toward finding actual algorithms in the course of doing one of the sandwich problems).

I also broadly agree with you that “things looking good to humans without actually being good” is a major problem to watch out for. But I don’t think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback. (E.g., one of the papers I link in the post is a mainstream ML paper amplifying a weak training signal into a better one.)
- johnswentworth 6 Mar 2021 16:20 UTC
  LW: 6 AF: 4
  AF Parent
  But I don’t think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback.
  I partially agree with this; alignment is a bottleneck to value for GPT, and actually aligning it would likely produce some very impressive stuff. My disagreement is that it’s a lot easier to make something which looks impressive than something which solves a Hard problem (like the sandwich problem), and therefore most impressive-looking “solutions” will probably circumvent the key part of the problem. And if the Hard problem is indeed hard enough to not be solved by anyone, the most impressive-looking results will be those which look good without actually solving it.
  - Ajeya Cotra 6 Mar 2021 16:46 UTC
    LW: 6 AF: 4
    AF Parent
    I guess the crux here is “And if the Hard problem is indeed hard enough to not be solved by anyone,” — I don’t think that’s the default/expected outcome. There hasn’t been that much effort on this problem in the scheme of things, and I think we don’t know where it ranges from “pretty easy” to “very hard” right now.
    What links here?
    Rant on Problem Factorization for Alignment by johnswentworth (5 Aug 2022 19:23 UTC; 92 points)
    - johnswentworth 6 Mar 2021 18:08 UTC
      LW: 22 AF: 11
      AF Parent
      Ah… I think we have an enormous amount of evidence on very-similar problems.
      For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn’t know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.
      In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analogue of the “sandwich problem” would be to get the lawyer + non-expert business-owner to come up with a contract as good as the expert business-owner would. This sort of problem has been around for centuries, and I don’t think we have a good solution in practice; I’d expect the expert business-owner to usually come up with a much better contract.
      This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn’t understand what the designer wants), versus a product designer who’s also a fluent coder and familiar with the code base. I’ve experienced this one first-hand; the expert product designer is way better. Or, consider a well-intentioned mortgage salesman, who wants to get their customer the best mortgage for them, and the customer who understands the specifics of their own life but knows nothing about mortgages. Will they end up with as good a mortgage as a customer who has expertise in mortgages themselves? Probably not. (I’ve seen this one first-hand too.)
      What links here?
      Rant on Problem Factorization for Alignment by johnswentworth (5 Aug 2022 19:23 UTC; 92 points)
      johnswentworth's comment on Discussion with Eliezer Yudkowsky on AGI interventions by Rob Bensinger (13 Nov 2021 17:39 UTC; 23 points)
      - Rohin Shah 6 Mar 2021 22:24 UTC
        LW: 5 AF: 5
        AF Parent
        One approach is to let the human giving feedback think for a long time. Maybe the business owner by default can’t write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there’s hope in the AI case (e.g. that’s a hope behind iterated amplification).
        johnswentworth 6 Mar 2021 23:06 UTC
        LW: 5 AF: 4
        AF Parent
        How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.
        Rohin Shah 7 Mar 2021 0:45 UTC
        LW: 5 AF: 5
        AF Parent
        Yeah, sorry, that’s right, I was speaking pretty loosely. You’d still have the same hope—maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.
        johnswentworth 7 Mar 2021 5:44 UTC
        LW: 7 AF: 4
        AF Parent
        Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don’t see any reason at all to expect it to do anything remotely similar to that.
        Ajeya Cotra 7 Mar 2021 6:15 UTC
        LW: 7 AF: 5
        AF Parent
        The intuition for it is something like this: suppose I’m trying to make a difficult decision, like where to buy a house. There are hundreds of cities I’d be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood.
        
        If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of “holistic judgment of neighborhood X”, and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.
        johnswentworth 7 Mar 2021 6:50 UTC
        LW: 7 AF: 5
        AF Parent
        I see, so it’s basically assuming that problems factor.
        Ajeya Cotra 7 Mar 2021 7:07 UTC
        LW: 7 AF: 3
        AF Parent
        Yeah, in the context of a larger alignment scheme, it’s assuming that in particular the problem of answering the question “How good is the AI’s proposed action?” will factor down into sub-questions of manageable size.
        Raemon 7 Mar 2021 6:00 UTC
        LW: 6 AF: 4
        AF Parent
        I had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)
        paulfchristiano 7 Mar 2021 8:19 UTC
        LW: 9 AF: 7
        AF Parent
        That’s what I have in mind. If all goes well you can think of it like “a human thinking a long time.” We don’t know if all will go well.
        It’s also not really clear what “a human thinking 10,000 years” means, HCH is kind of an operationalization of that, but there’s a presumption of alignment in the human-thinking-a-long-time that we don’t get for free here. (Of course you also wouldn’t get it for free if you somehow let a human live for 10,000 years...)
        What links here?
        johnswentworth's comment on The Plan by johnswentworth (12 Dec 2021 17:39 UTC; 9 points)
        adamShimi 7 Mar 2021 21:53 UTC
        LW: 5 AF: 4
        AF Parent
        Well, Paul’s original post presents HCH as the specification of a human enlightened judgement.
        For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.
        And if we follow the links to Paul’s previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.
        To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.
        So it looks to me like “HCH captures the judgment of the human after thinking from a long time” is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don’t know the answer.
        A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.
        Rohin Shah 7 Mar 2021 16:52 UTC
        LW: 5 AF: 5
        AF Parent
        I agree with the other responses from Ajeya / Paul / Raemon, but to add some more info:
        Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from???
        … I don’t really know. My guess is that I picked it up from reading giant comment threads between Paul and other people.
        I don’t see any reason at all to expect it to do anything remotely similar to that.
        Tbc it doesn’t need to be literally true. The argument needed for safety is something like “a large team of copies of non-expert agents could together be as capable as an expert”. I see the argument “it’s probably possible for a team of agents to mimic one agent thinking for a long time” as mostly an intuition pump for why that might be true.
        johnswentworth 7 Mar 2021 17:33 UTC
        LW: 5 AF: 4
        AF Parent
        “As capable as an expert” makes more sense. Part of what’s confusing about “equivalent to a human thinking for a long time” is that it’s picking out one very particular way of achieving high capability, but really it’s trying to point to a more-general notion of “HCH can solve lots of problems well”. Makes it sound like there’s some structural equivalence to a human thinking for a long time, which there isn’t.
        Rohin Shah 7 Mar 2021 18:46 UTC
        LW: 6 AF: 5
        AF Parent
        Makes it sound like there’s some structural equivalence to a human thinking for a long time, which there isn’t.
        Yes, I explicitly agree with this, which is why the first thing in my previous response was
        sorry, that’s right, I was speaking pretty loosely.
        Ajeya Cotra 7 Mar 2021 0:01 UTC
        LW: 3 AF: 3
        AF Parent
        My understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.
        johnswentworth 7 Mar 2021 0:47 UTC
        LW: 11 AF: 5
        AF Parent
        HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.
        (This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)
        Ajeya Cotra 7 Mar 2021 5:08 UTC
        LW: 1 AF: 1
        AF Parent
        Yes sorry — I’m aware that in the HCH procedure no one human thinks for a long time. I’m generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could “effectively replicate the benefits you could get from having a human thinking a long time,” in terms of the role that it plays in an overall scheme for alignment. This isn’t guaranteed to work out, of course. My position is similar to Rohin’s above:
        
        I just personally find it easier to think about “benefits of a human thinking for a long time” and then “does HCH get the same benefits as humans thinking for a long time” and then “does iterated amplification get the same benefits as HCH”.