Mark Xu comments on On how various plans miss the hard bits of the alignment challenge

Mark Xu 12 Jul 2022 16:53 UTC
LW: 21 AF: 6
5
AF
Flagging that I don’t think your description of what ELK is trying to do is that accurate, e.g. we explicitly don’t think that you can rely on using ELK to ask your AI if it’s being deceptive, because it might just not know. In general, we’re currently quite comfortable with not understanding a lot of what our AI is “thinking”, as long as we can get answers to a particular set of “narrow” questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.

Separately, I think that ELK isn’t intended to address the problem you refer to as a “sharp-left turn” as I understand it. Vaguely, ELK is intended to be an ingredient in an outer-alignment solution, while it seems like the problem you describe falls roughly into the “inner alignment” camp. More specifically, but still at a high-level of gloss, the way I currently see things is:
- If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
- Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly imply that they should disempower humans, you don’t need a “sharp left turn” in order for humanity to end up disempowered.
- Given this, it seems like there’s still a substantial part of the difficulty of alignment that remains to be solved even if knew how to cope with the “sharp left turn.” That is, even if capabilities were continuous in SGD steps, training powerful AIs would still result in catastrophe.
- ELK is intended to be an ingredient in tackling this difficulty, which has been traditionally referred to as “outer alignment.”
Even more separately, it currently seems to me like it’s very hard to work on the problem you describe while treating other components [like your loss function] like a black box, because my guess is that “outer alignment” solutions need to do non-trivial amounts of “reaching inside the model’s head” to be plausible, and a lot of how to ensure capabilities and alignment generalize together is going to depend on details about how would have prevented it from murdering you in [capabilities continuous with SGD] world.

ELK for learned optimizers has some more details.
- paulfchristiano 12 Jul 2022 19:01 UTC
  LW: 9 AF: 7
  3
  AF Parent
  I think that the sharp left turn is also relevant to ELK, if it leads to your system not generalizing from “questions humans can answer” to “questions humans can’t answer.” My suspicion is that our key disagreements with Nate are present in the case of solving ELK and are not isolated to handling high-stakes failures.
  (However it’s frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post?)