NunoSempere comments on My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda

NunoSempere 5 Oct 2020 21:28 UTC
1 point
0

What would a corrigible but not-intent-aligned AI system look like?

Spoilers for Pokemon: The Origin of Species:

The latest chapter of TOoS has a character create a partition of their personality with the aim of achieving “Victory”. At some point, that partition and the character disagree about what the terminal goals of the agent are, and how best to achieve them, presumably because the partition had an imperfect understanding of their goals. Once the character decides to delete the partition, the partition lets it happen (i.e., is corrigible).

Link to the chapter See also discussion in r/rational

The reverse could also have been true: the partition could have had a more perfect understanding of the terminal values and become incorrigible. For example, imagine an agent who was trying to help out an homeopathic doctor.
- riceissa 5 Oct 2020 21:57 UTC
  2 points
  0
  Parent
  Here is part of Paul’s definition of intent alignment:
  In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
  So in your first example, the partition seems intent aligned to me.