Rohin Shah comments on More variations on pseudo-alignment

Rohin Shah 5 Nov 2019 1:18 UTC
LW: 6 AF: 5
0
AF
I think that suboptimality deceptive alignment complicates a lot of stories for how we can correct issues in our AIs as they appear.
I don’t think it would be useful to actually discuss this, since I expect the cruxes for our disagreement are elsewhere, but since it is a direct disagreement with my position, I’ll state (but not argue for) my position here:
- I expect that something (people, other AIs) will continue to monitor AI systems after deployment, though probably not as much as during training.
- I don’t think a competent-at-human-level system doesn’t know about deception, and I don’t think a competent-at-below-human-level system can cause extinction-level catastrophe (modulo something about “unintentionally causing us to have the wrong values”)
- Even if the AI system “discovers” deception during deployment, I expect that it will be bad at deception, or will deceive us in small-stakes situations, and we’ll notice and fix the problem.
(There’s a decent chance that I don’t reply to replies to this comment.)
- evhub 5 Nov 2019 1:44 UTC
  LW: 9 AF: 7
  0
  AF Parent
  Perfectly reasonable for you to not reply like you said, though I think it’s worthwhile for me to at least clarify one point:
  
  I don’t think a competent-at-human-level system doesn’t know about deception, and I don’t think a competent-at-below-human-level system can cause extinction-level catastrophe
  
  A model which simply “doesn’t know about deception” isn’t the only (or even the primary) situation I’m imagining. The example I gave in the post was a situation in which the model hadn’t yet “figured out that deception is a good strategy,” which could be:
  - because it didn’t know about deception,
  - because it thought that deception wouldn’t work,
  - because it thought it was fully aligned,
  - because the training process constrained its thoughts such that it wasn’t able to even think about deception,
  or some other reason. I don’t necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.
  What links here?
  - Rohin Shah's comment on More variations on pseudo-alignment by evhub (5 Nov 2019 15:44 UTC; 5 points)
  - Rohin Shah 5 Nov 2019 7:59 UTC
    LW: 5 AF: 4
    0
    AF Parent
    Yes, good point. I’d make the same claim with “doesn’t know about deception” replaced by “hasn’t figured out that deception is a good strategy (assuming deception is a good strategy)”.
    - Daniel Kokotajlo 6 Nov 2019 0:22 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Humans who believe in God still haven’t concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
      (Though maybe I’m wrong about this claim—maybe if we ask some believers they would tell us “yeah I am just being good to make it to the next life, where hopefully I’ll have a little more power and freedom and can go buck wild.”)