zhukeepa comments on Can corrigibility be learned safely?

zhukeepa 8 Apr 2018 19:57 UTC
7 points
0
I replied about (2) and black swans in a comment way down.
I’m curious to hear more about your thoughts about (4).
To flesh out my intuitions around (4) and (5): I think there are many tasks where a high-dimensional and difficult to articulate piece of knowledge is critical for completing the task. For example:
- if you’re Larry or Sergey trying to hire a new CEO, you need your new CEO to be a culture fit. Which in this case means something like “being technical, brilliant, and also a hippie at heart”. It’s really, really hard to communicate this to a slick MBA. Especially the “be a hippie at heart” part. Maybe if you sent them to Burning Man and had them take a few drugs, they’d grok it?
- if you’re Bill Gates hiring a new CEO, you should make sure your new CEO is also a developer at heart, not a salesman. Otherwise, you might hire Steve Ballmer, who drove Microsoft’s revenues through the roof for a few years, but also had little understanding of developers (for example he produced an event where he celebrated developers in a way developers don’t tend to like being celebrated). This led to an overall trend of the company losing its technical edge, and thus its competitive edge… this was all while Ballmer had worked with Gates at Microsoft for two decades. If Ballmer was a developer, he may have been able avoid this, but he very much wasn’t.
- if you’re a self-driving car engineer delegating image classification to a modern-day neural net, you’d really want its understanding of what the classifications mean to match yours, lest they be susceptible to clever adversarial attacks. Humans understand the images to represent projections of crisp three-dimensional objects that exist in a physical world; image classifiers don’t, which is why they can get fooled so easily by overlays of random patterns. Maybe it’s possible to replicate this understanding without being an embodied agent, but it seems you’d need something beyond training a big neural net on a large collection of images, and making incremental fixes.
- if you’re a startup trying to build a product, it’s very hard to do so correctly if you don’t have a detailed implicit model of your users’ workflows and pain points. It helps a lot to talk to them, but even then, you may only be getting 10% of the picture if you don’t know what it’s like to be them. Most startups die by not having this picture, flying blind, and failing to acquire any users.
- if you’re trying to help your extremely awkward and non-neurotypical friend find a romantic partner, you might find it difficult to convey what exactly is so bad about carrying around slips of paper with clever replies, and pulling them out and reading from them when your date says something you don’t have a reply to. (It’s not that hard to convey why doing this particuar thing is bad. It’s hard to convey what exactly about it is bad, that would have him properly generalize and avoid all classes of mistakes like this going forward, rather than just going like “Oh, pulling out slips of paper is jarring and might make her feel bad, so I’ll stop doing this particular thing”.) (No, I did not make up this example.)
In these sorts of situations, I wouldn’t trust an AI to capture my knowledge/understanding. It’s often tacit and perceptual, it’s often acquired through being a human making direct contact with reality, and it might require a human cognitive architecture to even comprehend in the first place. (Hence my claims that proper generalization requires having the same ontologies as the overseer, which they obtained from their particular methods of solving a problem.)
In general, I feel really sketched about amplifying oversight, if the mechanism involves filtering your judgment through a bunch of well-intentioned non-neurotypical assistants, since I’d expect the tacit understandings that go into your judgment to get significantly distorted. (Hence my curiosity about whether you think we can avoid the judgment getting significantly distorted, and/or whether you think we can do fine even with significantly distorted judgment.)
It’s also pretty plausible that I’m talking completely past you here; please let me know if this is the case.
- paulfchristiano 8 Apr 2018 23:32 UTC
  6 points
  0
  Parent
  Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
  I agree that “AI systems are likely to generalize differently from humans.” I strongly believe we shouldn’t rest AI alignment on detailed claims about how an AI will generalize to a new distribution. (Though I do think we can hope to avoid errors of commission on a new distribution.)
  - zhukeepa 9 Apr 2018 20:58 UTC
    3 points
    0
    Parent
    Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
    I think my present view is something like a conjunction of:
    1. An AI needs to learn human representations in order to generalize like a human does.
    2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
    3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it’s easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.
    4. To the extent that “general and open-ended tasks” can be broken down into narrow tasks that don’t require human generalization, they don’t require human generalization to learn safely.
    My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it’s true but the bar for “sufficiently general and open-ended” is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?
    I’m confused about your thoughts on (1).
    (I’m currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)
    - paulfchristiano 10 Apr 2018 3:03 UTC
      6 points
      0
      Parent
      For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
      This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
      I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
      Let me know if that’s wrong.
      If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
      Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?