paulfchristiano comments on Why all the fuss about recursive self-improvement?

paulfchristiano 13 Jun 2022 1:06 UTC
65 points
30
we too predict that it’s easy to get GPT-3 to tell you the answers that humans label “aligned” to simple word problems about what we think of as “ethical”, or whatever. That’s never where we thought the difficulty of the alignment problem was in the first place. Before saying that this shows that alignment is actually easy contra everything MIRI folk said, consider asking some MIRI folk for their predictions about what you’ll see
I partly agree with this. Like you, I’m often frustrated by people thinking this is where the core of the problem will be, or where the alignment community mistakenly thought the core of the problem was. And this is especially bad when people see systems that all of us would expect to work, and thereby getting optimistic about exactly the same systems that we’ve always been scared of.
That said, I think this is an easy vibe to get from Eliezer’s writing, and it also looks to me like he has some prediction misses here. Overall it feels to me like this should call for more epistemic humility regarding the repeated refrain of “that’s doomed.”
Some examples that I feel like don’t seem to match recent results. I expect that Eliezer has a different interpretation of these, and I don’t think someone should interpret a non-response by Eliezer as any kind of acknowledgment that my interpretation is correct.
- Here: “Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like “being smart” and “being a good person” and “still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy”, is a pretty huge ask.”
  
  I think that progress in language modeling makes this view look much worse than it did in 2018. It sure looks like we are going to have inexact imitations of humans that are able to do useful work, to continue to broadly agree with humans about what you “ought to do” in a way that is common-sensically smart (such that to the extent you get useful work from them it’s still “good” in the same way as a human’s behavior). It also looks like those properties are likely to be retained when a bunch of them are collaborating in a “Chinese Room Bureaucracy,” though this is not clear. And it looks quite plausible that all of this is going to happen significantly before we have systems powerful enough that they are only imitating human behavior instrumentally (at which point X-and-only-X becomes a giant concern).
  
  The issue definitely isn’t settled yet. But I think that most people outside of LW would look at the situation and say “huh, Eliezer seems like he was too confidently pessimistic about the ability of ML to usefully+safely imitate humans before it became a scary optimization daemon.”
- Here: “Similar remarks apply to interpreting and answering “What will be its effect on _?” It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here. In particular, you don’t want your language for describing “effects” to partition, as the same state of described affairs, any two states which humans assign widely different utilities. Let’s say there are two plans for getting my grandmother out of a burning house, one of which destroys her music collection, one of which leaves it intact. Does the AI know that music is valuable? If not, will it not describe music-destruction as an “effect” of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone’s music collection? If you then say that the AI should describe changes to files in general, well, should it also talk about changes to its own internal files? Every action comes with a huge number of consequences—if we hear about all of them (reality described on a level so granular that it automatically captures all utility shifts, as well as a huge number of other unimportant things) then we’ll be there forever.”
  
  This comment only makes sense if you think that an AI will have difficulty predicting “which consequences are important to a human” or explaining those consequences to a human with any AI system weak enough to be safe. Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.
  
  Now I agree with you that there is a version of this concern that is quite serious and real. But this comment is very strongly framed about the capabilities of an AI weak enough to be safe, not about the difficulty of constructing a loss function that incentivizes that behavior, and some parts of it appear not to make sense when interpreted as being about a loss function (since you could use a weak AI like GPT-3 as a filter for which consequences we care about).
- Same comment: “[it] sounds like you think an AI with an alien, superhuman planning algorithm can tell humans what to do without ever thinking consequentialistically about which different statements will result in human understanding or misunderstanding. Anna says that I need to work harder on not assuming other people are thinking silly things, but even so, when I look at this, it’s hard not to imagine that you’re modeling AIXI as a sort of spirit containing thoughts, whose thoughts could be exposed to the outside with a simple exposure-function. It’s not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard”
  
  This really makes it sound like “non-self-modifying” and “Oracle” are these big additional asks, and that it’s hard to use language without reasoning internally about what humans will understand. That actually still seems pretty plausible, but I feel like you’ve got to admit that we’re currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans (and where the “superhuman planning algorithms” that we deal with are either simple planning algorithms written by humans, as in AlphaZero, or are mostly distilled from such planning algorithms in a way that would not make it harder to get them to express their views in natural language). And that’s got to be at least evidence against the kind of view expressed in this comment, which is strongly suggesting that by the time we are building transformative AI it definitely won’t look at all like that. The closer we now are to powerful AI, the stronger the evidence against becomes.
What links here?
- On Deference and Yudkowsky’s AI Risk Estimates by bgarfinkel (EA Forum; 19 Jun 2022 14:35 UTC; 285 points)
- So8res 13 Jun 2022 3:57 UTC
  16 points
  2
  Parent
  
  I think that progress in language modeling makes this view look much worse than it did in 2018.
  
  It doesn’t look much worse to me yet. (I’m not sure whether you know things I don’t, or whether we’re reading the situation differently. We could maybe try to bang out specific bets here at some point.)
  
  Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.
  
  For the record, there’s a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can’t do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.
  
  I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those “can you ask me in advance first” moments where I’m happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem “shallow and not much evidence” vs “either evidence that this AI is scary or actively in violation of my model”.
  
  I feel like you’ve got to admit that we’re currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans
  
  I don’t in fact think that the current levels of “explaining the consequences of their plans” are either impressive in the relevant way, or going to generalize in the relevant way. I do predict that things are going to have to change before the end-game. In response to these observations, my own models are saying “sure, this is the sort of thing that can happen before the end (although obviously some stuff is going to have to change, and it’s no coincidence that the current systems aren’t themselves particularly scary)”, because predicting the future is hard and my models don’t concentrate probability mass all that tightly on the details. It’s plausible to me that I’m supposed to be conceding a bunch of Bayes points to people who think this all falls on a continuum that we’re clearly walking along, but I admit I have some sense that people just point to what actually happened in a shallower way and say “see, that’s what my model predicted” rather that actually calling particulars in advance. (I can recall specific case of Dario predicting some particulars in advance, and I concede Bayes points there. I also have the impression that you put more probability mass here than I did, although fewer specific examples spring to mind, and I concede some fewer Bayes points to you.) I consider it to be some evidence, but not enough to shift me much. Reflecting on why, I think it’s on account of how my models haven’t taken hits that are bigger than they expected to take (on account of all the vaugaries), and how I still don’t know how to make sense of the rest of the world through my-understanding-of your (or Dario’s) lens.
  - paulfchristiano 13 Jun 2022 5:45 UTC
    27 points
    15
    Parent
    It doesn’t look much worse to me yet. (I’m not sure whether you know things I don’t, or whether we’re reading the situation differently. We could maybe try to bang out specific bets here at some point.)
    Which of “being smart,” “being a good person,” and “still being a good person in a Chinese bureaucracy” do you think is hard (prior to having AI smart enough to be dangerous)? Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?
    For the record, there’s a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can’t do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.
    Eliezer gave an example about identifying which of two changes we care about (“destroying her music collection” and “changes to its own files.”) That kind of example does not seem to involve deep reasoning about consequences-humans-care-about. Eliezer may be using this example in a more deeply allegorical way, but it seems like in this case the allegory has thrown out the important part of the example and I’m not even sure how to turn it into an example that he would stand behind.
    I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those “can you ask me in advance first” moments where I’m happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem “shallow and not much evidence” vs “either evidence that this AI is scary or actively in violation of my model”.
    You and Eliezer often suggest that particular alignment strategies are doomed because they involve AI solving hard tasks that won’t be doable until it’s too late (as in the quoted comment by Eliezer). I think if you want people to engage with those objections seriously, you should probably say more about what kinds of tasks you have in mind.
    My current sense is that nothing is in violation of your model until the end of days. In that case it’s fair enough to say that we shouldn’t update about your model based on evidence. But that also means I’m just not going to find the objection persuasive unless I see more of an argument, or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).
    I don’t in fact think that the current levels of “explaining the consequences of their plans” are either impressive in the relevant way, or going to generalize in the relevant way.
    I think language models can explain the consequences of their plans insofar as they understand those consequences at all. It seems reasonable for you to say “language models aren’t like the kind of AI systems we are worried about,” but I feel like in that case each unit of progress in language modeling needs to be evidence against your view.
    You are predicting that powerful AI will have property X (= can make plans with consequences that they can’t explain). If existing AIs had property X, then that would be evidence for your view. If existing AIs mostly don’t have property X, that must be evidence against your view. The only way it’s a small amount of evidence is if you were quite confident that AIs wouldn’t have property X.
    You might say that AlphaZero can make plans with consequences it can’t explain, and so that’s a great example of an AI system with property X (so that language models are evidence against your position, but AlphaZero is evidence in favor). That would seem to correspond to the relatively concrete prediction that AlphaZero’s inability to explain itself is fundamentally hard to overcome, and so it wouldn’t be easy to train a system like AlphaZero that is able to explain the consequences of its actions.
    Is that the kind of prediction you’d want to stand behind?
    - So8res 13 Jun 2022 15:31 UTC
      14 points
      3
      Parent
      (still travelling; still not going to reply in a ton of depth; sorry. also, this is very off-the-cuff and unreflected-upon.)
      
      Which of “being smart,” “being a good person,” and “still being a good person in a Chinese bureaucracy” do you think is hard (prior to having AI smart enough to be dangerous)?
      
      For all that someone says “my image classifier is very good”, I do not expect it to be able to correctly classify “a screenshot of the code for an FAI” as distinct from everything else. There are some cognitive tasks that look so involved as to require smart-enough-to-be-dangerous capabilities. Some such cognitive tasks can be recast as “being smart”, just as they can be cast as “image classification”. Those ones will be hard without scary capabilities. Solutions to easier cognitive problems (whether cast as “image classification” or “being smart” or whatever) by non-scary systems don’t feel to me like they undermine this model.
      
      “Being good” is one of those things where the fact that a non-scary AI checks a bunch of “it was being good” boxes before some consequent AI gets scary, does not give me much confidence that the consequent AI will also be good, much like how your chimps can check a bunch of “is having kids” boxes without ultimately being an IGF maximizer when they grow up.
      
      My cached guess as to our disageement vis a vis “being good in a Chinese bureaucracy” is whether or not some of the difficult cognitive challenges (such as understanding certain math problems well enough to have insights about them) decompose such that those cognitions can be split across a bunch of non-scary reasoners in a way that succeeds at the difficult cognition without the aggregate itself being scary. I continue to doubt that and don’t feel like we’ve seen much evidence either way yet (but perhaps you know things I do not).
      
      (from the OP:) Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.
      
      To be clear, I agree that GPT-3 already has strong enough understanding to solve the sorts of problems Eliezer was talking about in the “get my grandma out of the burning house” argument. I read (perhaps ahistorically) the grandma-house argument as being about how specifying precisely what you want is real hard. I agree that AIs will be able to learn a pretty good concept of what we want without a ton of trouble. (Probably not so well that we can just select one of their concepts and have it optimize for that, in the fantasy-world where we can leaf through its concepts and have it optimize for one of them, because of how the empirically-learned concepts are more likely to be like “what we think we want” than “what we would want if we were more who we wished to be” etc. etc.)
      
      Separately, in other contexts where I talk about AI systems understanding the consequences of their actions being a bottleneck, it’s understanding of consequences sufficient for things like fully-automated programming and engineering. Which look to me like they require a lot of understanding-of-consequences that GPT-3 does not yet possess. My “for the record” above was trying to make that clear, but wasn’t making the above point where I think we agree clear; sorry about that.
      
      Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?
      
      It would take a bunch of banging, but there’s probably some sort of “the human engineer can stare at the engineering puzzle and tell you the solution (by using thinking-about-consequences in the manner that seems to me to be tricky)” that I doubt an AI can replicate before being pretty close to being a good engineer. Or similar with, like, looking at a large amount of buggy code (where fixing the bug requires understanding some subtle behavior of the whole system) and then telling you the fix; I doubt an AI can do that before it’s close to being able to do the “core” cognitive work of computer programming.
      
      It seems reasonable for you to say “language models aren’t like the kind of AI systems we are worried about,” but I feel like in that case each unit of progress in language modeling needs to be evidence against your view.
      
      Maybe somewhat? My models are mostly like “I’m not sure how far language models can get, but I don’t think they can get to full-auto programming or engineering”, and when someone is like “well they got a little farther (although not as far as you say they can’t)!”, it does not feel to me like a big hit. My guess is it feels to you like it should be a bigger hit, because you’re modelling the skills that copilot currently exhibits as being more on-a-continuum with the skills I don’t expect language models can pull off, and so any march along the continuum looks to you like it must be making me sweat?
      
      If things like copilot smoothly increase in “programming capability” to the point that they can do fully-automated programming of complex projects like twitter, then I’d be surprised.
      
      I still lose a few Bayes points each day to your models, which more narrowly predict that we’ll take each next small step, whereas my models are more uncertain and say “for all I know, today is the day that language models hit their wall”. I don’t see the ratios as very large, though.
      
      or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).
      
      A man can dream. We may yet be able to find one, though historically when we’ve tried it looks to me like we are mostly reading the same history in different ways, which makes things tricky.
  - Quintin Pope 13 Jun 2022 5:23 UTC
    17 points
    −4
    Parent
    My specific prediction: “chain of thought” style approaches scale to (at least) human level AGI. The most common way in which these systems will be able to self-modify is by deliberately choosing their own finetuning data. They’ll also be able to train new and bigger models with different architectures, but the primary driver of capabilities increases will be increasing the compute used for such models, not new insights from the AGIs.
    What links here?
    Quintin Pope's comment on [Linkpost] Solving Quantitative Reasoning Problems with Language Models by Yitz (1 Jul 2022 14:21 UTC; 27 points)
  - lc 13 Jun 2022 5:26 UTC
    8 points
    2
    Parent
    I would love for you two to bet, not necessarily because of epistemic hygiene, but because I don’t know who to believe here and I think betting would enumerate some actual predictions about AGI development that might clarify for me how exactly you two disagree in practice.
- Vaniver 15 Jun 2022 4:38 UTC
  6 points
  1
  Parent
  It sure looks like we are going to have inexact imitations of humans that are able to do useful work, to continue to broadly agree with humans about what you “ought to do” in a way that is common-sensically smart (such that to the extent you get useful work from them it’s still “good” in the same way as a human’s behavior). It also looks like those properties are likely to be retained when a bunch of them are collaborating in a “Chinese Room Bureaucracy,” though this is not clear.
  I want to note there’s a pretty big difference between “what you say you ought to do” and “what you do”; I basically expect language models to imitate humans as well as possible, which will include lots of homo hypocritus things like saying it’s wrong to lie and also lying, and to the extent that it tries to capture “all things humans might say” it will be representing all sides of all cultural / moral battles, which seems like it misses on a bunch of consistency and coherency properties that humans say they ought to have.
  I feel like you’ve got to admit that we’re currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans
  This feels like the scale/regime complaint to me? Yes, people have built a robot that can describe the intended consequences of moving its hand around an enclosure (“I will put the red block on top of the blue cylinder”), or explain the steps of solving simple problems (“Answer this multiple choice reading comprehension question, and explain your answer”), but once we get to the point where you need nontrivial filtering (“Tell us what tax policy we should implement, and explain the costs and benefits of its various features”) then it seems like the sort of thing where most of the thoughts would be opaque or not easily captured in sentences.