Rohin Shah comments on [AN #156]: The scaling hypothesis: a plan for building AGI

Rohin Shah 20 Jul 2021 6:59 UTC
LW: 8 AF: 6
AF
Yes, that’s right, sorry about the confusion.
- Daniel Kokotajlo 21 Jul 2021 12:23 UTC
  LW: 2 AF: 2
  AF Parent
  Huh, then it seems I misunderstood you then! The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn. But this seems unsupported by the argument you made, and in fact the opposite seems supported by the argument you made. The argument you made was:
  There are several pretraining objectives that could have been used to train GPT-N other than next word prediction (e.g. masked language modeling). For each of these, there’s a corresponding model that the resulting GPT-N would “try” to <do the thing in the pretraining objective>. These models make different predictions about what GPT-N would do off distribution. However, by claim 3 it doesn’t matter much which pretraining objective you use, so most of these models would be wrong.
  Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.” The natural corollary is: We have no idea what the AI is trying to achieve, if it is trying to achieve anything at all. So instead of concluding “It’ll probably just keep filling in missing words as in training” we should conclude “we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.”
  - Rohin Shah 22 Jul 2021 14:35 UTC
    LW: 4 AF: 4
    AF Parent
    The fourth bullet point claims that GPT-N will go on filling in missing words rather than doing a treacherous turn.
    ?? I said nothing about a treacherous turn? And where did I say it would go on filling in missing words?
    EDIT: Ah, you mean the fourth bullet point in ESRogs response. I was thinking of that as one example of how such reasoning could go wrong, as opposed to the only case. So in that case the model_1 predicts a treacherous turn confidently, but this is the wrong epistemic state to be in because it is also plausible that it just “fills in words” instead.
    Seems to me the conclusion of this argument is that “In general it’s not true that the AI is trying to achieve its training objective.”
    Isn’t that effectively what I said? (I was trying to be more precise since “achieve its training objective” is ambiguous, but given what I understand you to mean by that phrase, I think it’s what I said?)
    we have no idea what it’ll do; treacherous turn is a real possibility because that’s what’ll happen for most goals it could have, and it may have a goal for all we know.
    This seems reasonable to me (and seems compatible with what I said)
    - Daniel Kokotajlo 23 Jul 2021 6:18 UTC
      LW: 4 AF: 4
      AF Parent
      OK cool, sorry for the confusion. Yeah I think ESRogs interpretation of you was making a bit stronger claim than you actually were.