Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Iason Gabriel’s 2020 article Artificial Intelligence, Values, and Alignment is a philosophical perspective on what the goal of alignment actually is, and how we might accomplish it. In the best spirit of modern philosophy, it provides a helpful framework for organizing what has already been written about levels at which we might align AI systems, and also provides a neat set of connections between concepts in AI alignment and concepts in modern philosophy.

Goals of alignment

Gabriel identifies six levels at which we might define what it means to align AI with something:

  1. Instructions: the agent does what I instruct it to do

  2. Expressed intentions: the agent does what I intend it to do.

  3. Revealed preferences: the agent does what my behaviour reveals I prefer.

  4. Informed preferences or desires: the agent does what I would want it to do if I were rational and informed.

  5. Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking.

  6. Values: the agent does what it morally ought to do, as defined by the individual or society.

Schemas like this are helpful because they can “pop us out” from our unexamined paradigms. If we are, for example, having a discussion about building AI from inside the “revealed preferences” paradigm, it is good to know that we are having a discussion from inside that paradigm. It is a virtue of modern philosophy to always be asking what unexamined paradigm we are inside, and to be pushing us to at least see that that we are inside such-and-such a paradigm, in order that we can examine it and decide whether to keep working within it.

In that spirit, I would like to offer a conceptualization of the paradigm that I think all six of these levels are within, in order that we might examine that paradigm and decide whether we wish to keep working within it. It seems to me that we presuppose that when we deploy AI, we will pass agency away from humans and into the hands of the AI, at least for as long as it takes the AI to execute our instructions, intentions, preferences, interest, or values. We imagine our future AIs as assistants, genies, or agents with which we are going to have some initial period of contact, followed by a period during which these powerful agents go off and do our bidding, followed perhaps by later iterations in which these agents come back for further instructions, intentions, preferences, interests, or values. We understandably find it troubling to consider turning so much of our agency over to an external entity, yet most work in AI alignment is about how to safely navigate this hand-off of agency, and it seems to me that there is relatively little discussion of whether we should be doing all of our thinking on the assumption of a hand-off of agency. I will call this paradigm that I think we are inside the Agency Hand-off Paradigm:

The Agency Hand-off Paradigm

A few brief notes on how this relates to existing work in AI alignment:

  • The AI alignment sub-field of corrigibility is concerned with the design of AI that we can at least switch off if we later regret the instructions, intentions, preferences, interests, or values that we gave it. This is of course a good property for AI to have if we are going to hand off agency to it, but we seem to be inside the Agency Hand-off Paradigm almost by default.

  • Stuart Russell’s work on interaction games is about transmitting instructions /​ intentions /​ preferences /​ interests /​ values from humans to AIs as an ongoing dialog rather than a one-shot up-front data dump, but this work still assumes that agency is going to be handed over to our AIs, it’s just that the arrow from “Human” to “AI” in the figure above becomes a sequence of arrows.

  • Eliezer Yudkowsky’s writing on coherent extrapolated volition and Paul Christiano’s writing on indirect normativity are both concerned with extracting values from humans in a way that bypasses our limited ability to articulate our own values. Yet both bodies of work presuppose that there is going to be some phase during which we extract values from humans, followed by a phase during which our AIs are going to take actions on the basis of these values. Under this assumption we indeed ought to be very concerned about getting the value-extraction step right since the whole future of the world hangs on it.

  • Significant portions of Nick Bostrom’s book Superintelligence were concerned with the dangers of open-ended optimization over the world. It seems to me that the basic reason to be concerned about powerful optimizers in the first place is that they are precisely the category of system that has the property of taking agency away from humans.

But perhaps there is room to question the Agency Hand-off Paradigm. I would very much like to see proposals for AI alignment that escape completely from the assumption that we are going to hand off agency to AI. What would it look like to have powerful intelligent systems that increased rather than decreased the extent to which humans have agency over the future?

Think of a child playing a sand pit. The child’s parent has constructed the sand pit for the child and will keep the child safe. If the child happens to find, say, a shard of glass, then the parent may take it away. But for the most part the parent will just let the child play and learn and grow. It would be a little strange to think of the parent as taking instructions, intentions, preferences, interests, or values from the child and then assuming agency over the arrangement of sand in the sand pit on that basis. Yes the parent has a sense of what is in the child’s best interests by taking away the shard of glass, but not because the parent understands the child’s intentions how all the sand should ultimately be arranged and is accelerating things in that direction, but rather because the shard of glass threatens the child’s ow agency in a way that the child cannot account for in the short term. In the long term the parent will help the child to grow in such a way that they will be able to safely handle sharp objects on their own, and the parent will eventually fade away from the child’s life completely. The long-run flow of agency is towards the child, not towards the parent. Is it not possible that we could build AI that ensures that agency flows towards us, not away from us, over the long run?