Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Introduction

As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/​discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/​monks.[1]

I was also invited to contribute, but I sort of ignored the instruction and decided to present an introduction to the AI Alignment Problem instead. It was a fun exercise in pedagogy, communication, and outreach :)

I decided to share a draft version here in case people find it interesting/​are interested in AI Alignment outreach/​feel like giving feedback. Note that I tried to write it while keeping in mind the context around the culture of Tibetan Buddhism, and so I made some stylistic choices that might seem strange otherwise.

Agents and the AI Alignment Problem

Introduction

This writing aims to help us make sense of the unique and critical point in history that we find ourselves in. Specifically, we are faced with the possibility that humankind will soon gain the ability to build powerful artificial agents capable of pursuing their own (potentially unwholesome) desires.

We will begin by introducing a framework for understanding how the interaction of a certain class of beings—agents—unfolds and determines the future state of the world they inhabit.

After introducing this framework, we will apply it to understand the potential consequences of creating artificial agents. We will come to understand how the development of artificial agents and the desired outcomes they pursue will drastically determine the evolution of our world and the fate of future sentient beings—for better or worse.

Agents, Power, and Dynamics

The framework will be introduced sequentially by looking into the nature of agents, power, and the dynamics of the interactions of agents. We will also consider how agents can take actions to increase their power.

Agents

An agent is a being who has preferences for how things are and takes actions to steer things toward how they’d prefer things to be. One can think of an animal that recognizes that it is in a state of hunger and so searches for food to return itself to a state of satiation. Agents may have preferences regarding the phenomena they experience (such as wanting to move away from experiences of pain) or preferences for less directly experienced states, such as the suffering of other beings unknown to them.

Next, we consider a system of agents who may causally influence one another. The action of one agent may affect the state of another. A leopard that hunts a goat certainly influences the goat and steers things towards unfavorable states from the goat’s perspective.

Note that a group of agents may behave as one agent when their preferences are aligned and when they can coordinate with one another. For example, an army consists of individuals who all desire for their side to win the battle.

It is useful to think of an agent’s actions influencing some “external world” where the actions of an agent influence the state of the world, and the state of the world affects the states of other agents. We note that because the state of an agent and the state of the world are interdependent, we can talk about agents having preferences for their own states and states of the world interchangeably.

Finally, we introduce the notion of “natural forces”: entities that can be said to influence the states of agents and the world, but do not have preferences themselves and so are not agents. Rain, a natural force, influences the growth of crops on the land and how this affects the farmer’s state.

Power

Once we have a system of agents with varying preferences for states of the world, who are all taking actions to steer the world towards their own preferred states, we may quickly arrive at the problem of conflict. Returning to our earlier example, the leopard wishes to be no longer hungry, translating to a state of the world in which it has killed and eaten the goat. The goat wishes to be alive and not be killed and eaten by the leopard. They both take actions to try to achieve their preferred state, but in the end, because their preferences conflict, they cannot both be satisfied.

We define power as an approximate measure of an agent’s ability to steer the world towards its preferences. Power is not easy to measure, and there are subtleties in this definition, but as long as we speak in broad terms, it points to something useful.

We note that the power of an agent is always relative to the preferences of the agent, the world, other agents, and external forces. A person skilled in combat may be able to defeat their opponents easily (and hence can be said to be powerful), yet against an army, or in the domain of politics, or when they are ill, they may find themselves easily defeated.

The key point is that powerful agents and their preferences are, by definition, strong causal factors in how the world unfolds, a point that we will return to in the next section.

Dynamics

Having introduced these fundamental notions of agents and power, we can now make some observations related to how a system of interacting agents evolves over time when agents have competing preferences.

The first observation we make is that agents with little power relative to other agents are unlikely to be able to steer the world if their preferences are at odds with those of other agents.

Next, we observe that if a single agent (or a group of agents with mutually aligned preferences) has significantly more power than the other agents (and natural forces) in the system, the state of the world will very likely evolve toward the preferences of that powerful agent.

Finally, we note that more complex dynamics emerge when a group of agents within the system have roughly equal power relative to one another. The outcomes of such conflicts can be quite unpredictable. It may be the case that one agent wins the conflict based on the specific conditions of the time. However, it can be the case that the intense friction generated by the equally balanced forces of the agents trying desperately to gain the upper hand results in actions being taken that lead to states unintended by any agent in the system.

In summary, we see that the future of the world is primarily determined by the desires and interactions of the most powerful agents and natural forces in the world.

Accumulating Power

Not only can agents spend their energy taking actions to steer the world, they can accumulate more power, thereby increasing their ability to steer the world in the future.

Some ways that an agent can increase its own power include increasing its resources such as money, land, etc.; increasing its knowledge, intelligence, or wisdom, and hence its ability to strategize and choose actions that will have the desired outcomes; by acquiring new technology that gives the agent access to new actions or reduces the energy needed to perform an action; or removing other agents who attempt to steer the world away from the agent’s preferences.

An agent can also increase their ability to steer the world in the future by causing other agents to take on their preferences or by creating new agents that share them. There are many ways this can be done. For instance, they can pay other agents to spend their energy on actions to pursue alternate outcomes temporarily, convert other agents through charisma or rhetoric, or bear children in the hopes that they will continue to work towards their parent’s goals. Each method aims to increase the force steering the world towards the agent’s desired outcome.

The ability for desired outcomes to spread from one agent to another gives rise to an alternative point of view on what entities primarily create future states. It is almost as though the ability of future states to embed themselves within the hearts of agents and drive the agents to bring that future state into existence makes the future states themselves the primary drivers of the dynamic: they retroactively bring themselves into existence.

Artificial Agents

Having developed a way of thinking about agents, power, and the dynamics of their interactions, we will now apply this view to reflect on the consequences of bringing human-made agents into existence.

In recent years, technological transformation has been rapid, leading to great benefits such as increased medical care and the potential for new forms of destruction.

A new technology unlike anything seen prior is being developed: a method of creating artificial agents endowed with whatever desires their creator specifies. The motivation for creating such agents is clear; as previously discussed, one way for an agent to bring about its preferred outcome is to build new agents that share its desires. This observation and recent scientific breakthroughs demonstrating that this might be possible soon are driving extraordinary effort and resources toward realizing this technology.

These artificially created agents have the potential to become more powerful than human agents due to the nature of their artificial bodies. These bodies allow for vast abilities and powers unavailable to ordinary humans. These abilities include taking on a range of diverse and new physical forms, existing in multiple physical locations at once, traveling great distances almost instantly, increasing their intelligence far beyond what any human is capable of, and so on.

As it stands today, the way we are set to create these artificial agents is analogous to summoning powerful deities to do our bidding rather than constructing a building according to a well-specified plan. It is a very real concern, shared by many leading experts, that these artificial agents may end up with unintended harmful desires even with their creator’s best intentions.

Concluding Thoughts

The desires and interactions of powerful agents determine the unfolding of our world. It has been so before, and this universal law will continue to hold.

Organizations are strongly moving to build powerful artificial agents who will come to dominate the unfolding of our world. Despite the risks involved, these organizations are pushing ahead, perhaps afraid of losing their power to competing factions that succeed in building artificial agents first.

If these organizations are not careful, the artificial agents they develop may develop their own selfish desires or the desires of selfish individuals, causing the evolution of our world to be potentially fraught with great suffering.

If the development of these agents continues, hope lies in understanding how to implant compassion and the intention for the flourishing of all sentient beings at the center of their hearts. If successful, then the future could be beyond our wildest hopes.

May the thoughts presented here inspire progress towards this end and benefit all beings!

  1. ^

    Karl Friston is one of the contributors to this project. I’m curious how the monks will get on with ideas such as the Free Energy Principle and Active Inference …