Extensive and Reflexive Personhood Definition

Epistemic status: Speculative idea

It is highly likely that making a friendly AI somehow requires a good definition of what is a morally significant person. The optimal solution may be to figure out what consciousness is, and point to that. But just in case that turns out to be impossible and/or ill defined, we should look at other options too.

In this post I will explore using an extensional definition for this task. Extensional definition is to define a concept by pointing at examples, rather than to describe it in terms of other concepts [1].

Some reasons to be optimistic about using extensional definition for personhood and other important moral concepts:

This is more or less just (semi) supervised learning, which means we can take advantage of the progress in this field.
You can teach “I don’t know how to define it, but I know it when I see it” type of information. This means we do not already have know exactly what should be considered a person, at the launch of the AI. We can make it up as we go along and include more and more types of objects in the training data over time.
The information is not tied to a specific ontology [2].

The main obstacle with extensional definition is that there is no way of making it complete. Therefore, the AI must keep learning forever. Therefore, we need a never ending pool of training data.

* * *

Here is my naive suggestion for person detection system, in the context of how it connects to friendliness [3]:

Hardcoded into the AI:

The concept of “person” as a process, which as some level of the map, can and should be modeled as an agent with beliefs and preferences [4].
The belief that persons are better than random at detecting other persons.
The AI’s goal is to optimise [5] for the aggregation [6] of the preferences [7] of all persons.

Give the AI some sort of initial person detector to get it started, e.g. :

Program that recognize human faces.
A specific person that can be queried.

The idea is that the AI can acquire more training data of what is a person by asking known persons for their beliefs. Since the AI tries to optimise for all persons preferences, it is motivated to learn who else is a person.

To begin with the AI is supposed to learn that humans are persons. But over time the category can expand to include more things (e.g. aliens, ems, non human animals [8]). The AI will consider you a person if most previous persons consider you a person. This inclusion mechanism is not perfect. For example, we humans has been embarrassingly slow to conclude that all humans are persons. However, we do seem to get there eventually, and I can’t think of any other inclusion method that has enough flexibility.

However, this naive construction for defining personhood is not safe, even if all dependences [3, 5, 6, 7] are solved.

In general, agents already identified as persons, will not agree on who else is a person. This might lead the AI to favour a perverse but more stable to the “who is a agent”-problem. E.g. concluding the persons refers only to the member of a small cult, who strongly believes that they and only they are real people.

It becomes even worse if we consider persons actively trying to hack the system. E.g. creating lots of simple computer programs that all agree with me and then convince the AI that they are all persons.

* * *

[1] Extensions and Intensions

[2] I plan go deeper into this in another post.

[3] Assuming robust AGI.

[4] Note that not everything that can modeled as an agent with beliefs and preferences is a person. But everything that is a person is assumed to have this structure at som level.

[5] I am skipping over the problem of how to line up the state of the world with persons preferences, rather than lining up persons preferences with the state of the world. Part of me believes that this is easily solved by separating preference learning and preference optimisation in the AI. Another part of me believes that this is a really hard problem which we can not even begin to work on until we have a better understanding of preferences [7] are.

[6] I have not yet found a method for aggregating that I like, but I think that this problem can be separated out from this discussion.

[7] The word “preference” is hiding so much complexity and confusion that I don’t even know if it is a good concept when applied to humans (see: Call for cognitive science in AI safety). Feel free to interpret “preference” as a placeholder for that thing in a person which is relevant for the AI’s decision.

[8] Note that caring about the wellbeing of X is different from considering them persons, in this formulation. If enough persons care about X, then the AI will pick up this preference, even if X is not a person.