Emotions = Reward Functions

… Or more specifically, a post on how, and why, to encode emotions to find out more about goals, rationality and safe alignment in general.

If one tries to naively fit reinforcement learning’s reward functions back onto the human mind, the closest equivalent one may find is emotions. If one delves deeper into the topic though, they will find a mish-mash of other “reward signals” and auxiliary mechanisms in the human brain, (such as the face tracking reflex, which aids us in social development) and ends up hearing about affects, the official term when it comes to the study of emotions.

At least that is what approximately happened with me.

Affects and reward functions seem to have a common functional purpose in agents, in that they both direct the agent’s attention towards what is relevant.


  • Evaluate the ‘goodness’ (valence) of a situation.

  • Are required for the ‘agent’ to perform any actions.

  • Define what the agents learn, what they value, and what goals they can create based on these values.

This means that if we can map and write all of the human affects into reward functions, we can compare various constellations of affects and see which ones produce what human-like behaviours. This in turn may lead to solutions for not only how to induce human-like biases into AI, but also investigate our own values and rationality from a new perspective.

The purpose of this post is to introduce a few ways to proceed with the task of encoding affects. First, there will be a rudimentary definition of the components and some motivational points on what this all could mean. After that there will be an introduction to three distinct levels of representation for various use cases, from philosophical to ontological, and finally pseudocode.


This post is meant to act as a conversation starter, so many points might be alarmingly concise.

The formalities are meant to replicate the functionality of human behaviour, and are not claimed to be exact copies of the neurological mechanisms themselves. Tests should be performed to find out what emerges in the end. Some points might be controversial, so discussion is welcome.


Alright, let’s define the two a bit more in depth and see how they compare.

Reward functions are part of reinforcement learning, where a “computer program interacts with a dynamic environment in which it must perform a certain goal (sic, goal here is the programmer’s) (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximize.”

- Wikipedia, reinforcement learning

Specific points about reward functions that will conveniently compare well with my argument:

  • Reward functions are the handcrafted functions that measure the agent’s input and assign a score w.r.t. how well they are doing.

  • They are what connect the state of the world to the best actions to take, encoded into the memory of the agent.

  • Rewards are essential in generating values, policies, models and goals.

Affect theory is “the idea that feelings and emotions are the primary motives for human behaviour, with people desiring to maximise their positive feelings and minimise their negative ones. Within the theory, affects are considered to be innate and universal responses that create consciousness and direct cognition.”

- APA, affect theory

Specific points about affects (citations pending):

  • Affects are the preprogrammed behavioural cues we got from the evolutionary bottleneck, our genes.

  • They direct what we learn onto our cortical regions about the world, being the basis for our values and goals.

  • Without affects, we would have no values, without values, we would have no goals [1, 2].

Disclaimer: The claim here is not that all values and goals necessarily come from emotions later in life, when they can be based on other values and existing knowledge. But rather, that the original source of our very first values came from affects during infancy and childhood, and thus the ultimate source for all values are, in the end, affects.

Further elaboration can be found also from appraisal theory and affective neuroscience.

So what is common?

Both frameworks define what the agent can learn, what they value, and what goals they can create based on these values. I will posit here even further that neither humans nor AI would “learn what to do” if there weren’t any criteria towards which to learn, thus doing reflexive and random actions only. We can see this clearly from the definition of RL-agents: remove their reward function, and they cannot learn the “relevant” connections from the environment they work in. With humans we could study brain lesion patients and birth defects, but more on that later. What I found thus far was inconclusive, but the search continues.

But what does it all mean?

Meanwhile, let’s discuss a number of beliefs I have regarding the topic, some might be more certain than others. All of these could have a discussion of their own, but I will simply list them here for now.

  1. Turning affects into reward functions will enable agents to attain “human-like intelligence”. Note, NOT human-level, but an intelligence with possibly the same biases, such as the bias to learn social interactions more readily.

  2. Affects are our prime example of an evaluation-system working in a general intelligence. Although they might not be optimal together, we can compare various constellations of affects and see which ones produce what human-like behaviours.

  3. We could align AI better for humans if we knew more about how we ourselves form our values.

  4. We could also formulate an extra layer for rationality if we better understood the birth and emergence of various value sets.

  5. We can better communicate with AI if their vocabulary would be similar to ours. Introducing the same needs to form social connections and directing their learning to speak could allow an AI to learn language as we do.

  6. If our goals are defined by years of experience on top of affects, we are hard pressed to define such goals for the AI. If we tell it to “not kill humans”, it does not have the understanding of ‘human’, ‘kill’ or even ‘not’, until it has formulated those concepts into its knowledge base/​cortical regions over years worth of experience (f. ex. GPT has ‘centuries worth of experience’ in the training data it has). At that point it is likely too late.

  7. If all goals and values stem from emotions, then:

  • There are no profound reasons to have a certain value set, as they are all originating from the possibly arbitrary set of affects we have. But certain values can come as natural by-products to certain instrumental goals, such as survival.

  • Rationality can thus be done only w.r.t. to a certain set of values.

Alright, back to business.

Examples on representation

Here are three different levels of representation for encoding affects and their interplay within consciousness, from philosophical ponderings to actual pseudocode.


Surprise was already partially developed with TD-Lambda, but has been further refined by DeepMind with episodic curiosity.

In practice, the agent could be constantly predicting what it might perceive next, and if this prediction is wildly different, the prediction error could be directly proportional to the amount of surprise.

Ontology card:

Reciprocity violation

An example of a more fleshed out ontology with the theoretical fields addressed.

This was a card we designed with an affective psychologist some time ago, it basically outlines the interface for the affect within the whole system.

Pseudocode: Learning the concept ‘agent’

Affects also have implicit requirements for their functionality, such as the concept “another agent” to tie all of the social affects to. Sorry, this one is a bit of a mess, the idea should be visible though.

This pseudocode addresses a developmental window we humans have, which helps us generate the concept of “agents” faster and more reliably. It is a learned feature within our consciousness, but because many of our social affects are based on the concept of an agent, this is something the genes just have to be sure that the brain has learned (and then linked to via a preprogrammed pointer, meaning the ‘affect’). We can see this mechanism breaking partially in some cases of severe autism, where the child doesn’t look people in the eyes.

These pseudocodes could be eventually combined into a framework of actual code, and may be tested in places such as OpenAI’s multi-agent environment.

Alright, that’s it for this one. I’ll just end this with a repetition of the disclaimer:

Tests should be performed to find out what emerges in the end. Some points might be controversial, so discussion is welcome, and this material will be eventually rectified.