Andrew_Critch comments on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew_Critch 7 Apr 2021 1:37 UTC
LW: 23 AF: 8
0
AF
I don’t understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research.
I don’t mean to say this post warrants a new kind of AI alignment research, and I don’t think I said that, but perhaps I’m missing some kind of subtext I’m inadvertently sending?
I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are “new kinds” of research (I promoted them heavily in my preceding post), and none of which I would call “alignment research” (though I’ll respect your decision to call all these topics “alignment” if you consider them that).
I would say, and I did say:
directing more x-risk-oriented AI research attention toward understanding RAAPs and how to make them safe to humanity seems prudent and perhaps necessary to ensure the existential safety of AI technology. Since researchers in multi-agent systems and multi-agent RL already think about RAAPs implicitly, these areas present a promising space for x-risk oriented AI researchers to begin thinking about and learning from.
I do hope that the RAAP concept can serve as a handle for noticing structure in multi-agent systems, but again I don’t consider this a “new kind of research”, only an important/necessary/neglected kind of research for the purposes of existential safety. Apologies if I seemed more revolutionary than intended. Perhaps it’s uncommon to take a strong position of the form “X is necessary/important/neglected for human survival” without also saying “X is a fundamentally new type of thinking that no one has done before”, but that is indeed my stance for X $\in$ {a variety of non-alignment AI research areas}.
- Vanessa Kosoy 7 Apr 2021 12:26 UTC
  LW: 23 AF: 11
  0
  AF Parent
  From your reply to Paul, I understand your argument to be something like the following:
  1. Any solution to single-single alignment will involve a tradeoff between alignment and capability.
  2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
  3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
  4. Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe.
  5. We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn’t have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy).
  I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to “just” solving single-single alignment.