This will be posted also on the EA Forum, and included in a sequence containing some previous posts and other posts I’ll publish this year.

Introduction

Humans think critically about values and, to a certain extent, they also act according to their values. To the average human, the difference between increasing world happiness and increasing world suffering is huge and evident, while goals such as collecting coins and collecting stamps are roughly on the same level.

It would be nice to make these differences obvious to AI as they are to us. Even though exactly copying what happens in the human mind is probably not the best strategy to design an AI that understands ethics, having an idea of how value works in humans is a good starting point.

So, how do humans reason about values and act accordingly?

Key points

Let’s take a step back and start from sensation. Through the senses, information goes from the body and the external environment to our mind.

After some brain processing — assuming we’ve had enough experiences of the appropriate kind — we perceive the world as made of objects. A rock is perceived as distinct from its surrounding environment because of its edges, its colour, its weight, the fact that my body can move through air but not through rocks, and so on.

Objects in our mind can be combined with each other to form new objects. After seeing various rocks in different contexts, I can imagine a scene in which all these rocks are in front of me, even though I haven’t actually seen that scene before.

We are also able to apply our general intelligence — think of skills such as categorisation, abstraction, induction — to our mental content.

Other intelligent animals do something similar. They probably understand that, to satisfy thirst, water in a small pond is not that different from water flowing in a river. However, an important difference is that animals’ mental content is more constrained than our mental mental content: we are less limited by what we perceive in the present moment, and we are also better at combining mental objects with each other.

For example, to a dog, its owner works as an object in the dog’s mind, while many of its owner’s beliefs do not. Some animals can attribute simple intentions and perception, e.g. they understand what a similar animal can and cannot see, but it seems they have trouble attributing more complex beliefs.

The ability to compose mental content in many different ways is what allows us to form abstract ideas such as mathematics, religion, and ethics, just to name a few.

Key point 1:

In humans, mental content can be abstract.

Now notice that some mental content drives immediate action and planning. If I feel very hungry, I will do something about it, in most cases.

This process from mental content to action doesn’t have to be entirely conscious. I can instinctively reach for the glass of water in front of me as a response to an internal sensation, even without moving my attention to the sensation nor realising it is thirst.

Key point 2:

Some mental content drives behaviour.

Not all mental content drives action and planning. The perception of an obstacle in front of me might change how I carry out my plans and actions, but it is unlikely to change what I plan and act for. Conversely, being very hungry directly influences what I’m going to do — not just how I do it — and can temporarily override other drives. It is in this latter sense that some mental content drives behaviour.

In humans, the mental content that does drive behaviour can be roughly split in two categories.

The first one groups what we often call evolutionary or innate drives, like hunger and thirst in the examples above, and works similarly in other animals. It is mostly fixed, in the sense that unless I make drastic changes to my body or mind, I will keep perceiving how hungry I am and this will influence my behaviour virtually each day of my life.

The second category is about what we recognise as valuable, worth doing, better than possible alternatives, or simply good. This kind of drive is significantly less fixed than the first category: what we consider valuable may change after we reflect on it in context with our other beliefs, or as a consequence of life experiences.

Some examples will help clarify this. Think of a philosopher who adjusts her beliefs about value as she learns and reflects more about ethics, and then takes action in line with her new views. Or consider a turned atheist, who has stopped placing value on religion and praying because he now sees the concept of god as inconsistent with everything else he knows about the world.

This second category of mental content that drives behaviour is not only about ethical or abstract beliefs. A mundane example might be more illustrative: someone writes down a shopping list after an assessment of what seems worth buying at that moment, then proceeds with the actual shopping. In this case, the influence of deliberation on future action is straightforward.

Key point 3:

In humans, part of the mental content that drives behaviour changes with experience and reflection.

This last point clarifies some of the processes underlying the apparently simple statement that ‘we act according to our values’.

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully).

Differences with animals and AI

Animals

Point 3 is fundamental to human behaviour. Together with point 1, it explains why some of our actions have motives that are quite abstract and not immediately reducible to evolutionary drives. In contrast, the behaviour of other animals is more grounded in perception, and is well explained even without recurring to reflection or an abstract concept of value.

AI

Point 3 is also a critical difference between humans and current AI systems. Even though AIs are getting better and better at learning – thus, in a sense, their behaviour changes with experience – their tasks are still chosen by their designers, programmers, or users, not by each AI through a process of reflection.

This shouldn’t be surprising: in a sense, we want AIs to do what we want, not what they want. At the same time, I think that connecting action to reflection in AI will, with enough research and experiments, allow us to get AI that thinks critically about values and sees the world through lenses similar to ours.

In a future post I’ll briefly go through the (lack of) research related to AI that reflects on what is valuable and worth doing. I’ll also give some ideas about how to write an algorithm of an agent that reflects.

Appendix: quick comparison with shard theory

As far as I understand, shard theory is still a work in progress; in this comparison I’ll focus just on some interesting ideas I’ve read in Reward is not the optimization target.

In a nutshell, Alex Turner sees humans as reinforcement learning (RL) agents, but makes the point that reward does not work like many people in the field of RL think it works. Turner writes that “reward is not, in general, that-which-is-optimized by RL agents”; many RL agents do not act as reward maximisers in the real world. Rather, reward imposes a reinforcement schedule that shapes the agent’s cognition, by e.g. reinforcing thoughts and/or computations in a context, so that in the future they will be more likely to happen in a similar enough context.

I agree with Turner that modelling humans as simple reward maximisers is inappropriate, in line with everything I’ve written in this post. At the same time, I don’t think that people who write papers about RL are off-track: I consider AIXI to be a good mathematical abstraction of many different RL algorithms, convergence theorems are valid for these algorithms, and thinking of RL in terms of reward maximisation doesn’t seem particularly misleading to me.

Thus, I would solve this puzzle about human values, reward, and RL not by revisiting the relation between reward and RL algorithms, but by avoiding the equation between humans and RL agents. RL, by itself, doesn’t seem a good model of what humans do. If asked why humans do not wirehead, I would reply that it’s because what we consider valuable and worth doing competes with other drives in action selection, not by saying that humans are RL agents but reward works differently from how RL academics think it works.

Having said that, I still find many ideas in Reward is not the optimization target really interesting and instructive, e.g. that reward acts as a reinforcement schedule. It’s probably among the most thought-provoking posts I’ve read on the Alignment Forum.

This work was supported by CEEALAR and by an anonymous donor.

Thanks to Nicholas Dupuis for many useful comments on a draft.

On value in humans, other animals, and AI