AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

This is an extended transcript of the talk I gave at EAGxAsiaPacific 2020. In the talk, I present a somewhat critical take on how AI alignment has grown as a field, and how, from my perspective, it deserves considerably more philosophical and disciplinary diversity than it has enjoyed so far. I’m sharing it here in the hopes of generating discussion about the disciplinary and philosophical paradigms that (I understand) the AI alignment community to be rooted in, and whether or how we should move beyond them. Some sections cover introductory material that most people here are likely to be familiar with, so feel free to skip them.

The Talk

Hey everyone, my name is Xuan (IPA: ɕɥɛn), and I’m doctoral student at MIT doing cognitive AI research. Specifically I work on how we can infer the hidden structure of human motivations by modeling humans using probabilistic programs. Today though I’ll be talking about something that’s more in the background that informs my work, and that’s about AI alignment, philosophical pluralism, and the relevance of non-Western philosophy.

This talk will cover a lot of ground, so I want to give an overview to keep everyone oriented:

  1. First, I’ll give a brief introduction to what AI alignment is, and why it likely matters as an effective cause area.

  2. I’ll then highlight some of the philosophical tendencies of current AI alignment research, and argue that they reflect a relatively narrow set of philosophical views.

  3. Given that these philosophical views may miss crucial considerations, this situation motivates the need for greater philosophical and disciplinary pluralism.

  4. And then as a kind of proof by example, I’ll aim to demonstrate how non-Western philosophy might provide insight into several open problems in AI alignment research.

A brief introduction to AI alignment

So what is AI alignment? One way to cache it out is the project of building intelligent systems that robustly act in our collective interests — in other words, building AI that is aligned with our values. As many people in the EA community have argued, this is a highly impactful cause area if you believe the following:

  1. AI will determine the future of our civilization, perhaps by replacing humanity as the most intelligent agents on this planet, or by having some other kind of transformative impact, like enabling authoritarian dystopias.

  2. AI will be likely be misaligned with our collective interests by default, perhaps because it’s just very hard to specify what our values are, or because of bad systemic incentives.

  3. Not only is this problem really difficult to solve, we also cannot delay solving it.

To that last point, basically everyone who works in AI alignment thinks it’s a really daunting technical and philosophical challenge. Human values, whatever they are, are incredibly complex and fragile, and so every seemingly simple solution to aligning superhuman AI is subject to potentially catastrophic loopholes.

I’ll illustrate this by way of this short dialogue between a human and a fictional super-intelligent chatbot called GPT-5, who’s kind of like this genie in a bottle. So you start up this chatbot and you ask:

Human Dear GPT-5, please make everyone on this planet happy.

Okay, I will place them in stasis and inject heroin so they experience eternal bliss. GPT-5

Human No no no, please don’t. I mean satisfy their preferences. Not everyone wants heroin.

Alright. But how should I figure out what those preferences are? GPT-5

Human Just listen to what they say they want! Or infer it from how they act.

Hmm. This person says they can’t bear to hurt animals, but keeps eating meat. GPT-5

Human Well, do what they would want if they could think longer, or had more willpower!

I extrapolate that they will come to support human extinction to save other species. GPT-5

Human Actually, just stop.

How do I know if that’s what you really want? GPT-5

An overview of the field

So that’s a taste of the kind of problem we need to solve. Obviously there’s a lot to unpack here about philosophy, what people really want, what desires are, what preferences are, and whether should we always satisfy those preferences. Before diving more into that, I think it’d be helpful to give a sense of what AI alignment research is like today, so we can get better sense of what might still be needed to answer these daunting questions.

There have been multiple taxonomies of AI alignment research, one of the earlier ones being Concrete Problems in AI Safety in 2016, suggesting topics like avoiding negative side effects and safe exploration. In 2018, DeepMind offers another categorization, breaking things down into specification, robustness, and assurance. And at EA Global 2020, Rohin Shah laid out another useful way of thinking about the space, breaking specification down into outer and inner alignment, and highlighting the question of scaling to superhuman competence while preserving alignment.

One notable feature of these taxonomies is their decidedly engineering bent. You might be wondering — where is the philosophy in all this? Didn’t we say there were philosophical challenges? And it’s actually there, but you have to look closely. It’s often obscured by the technical jargon. In addition, there’s this tendency to formalize philosophical and ethical questions as questions about rewards and policies and utility functions — which I think is something that can be done a little too quickly.

Another way to get a sense of what might currently be missing in AI alignment is to look at the ecosystem and its key players.

AI alignment is actually a really small and growing field, composed of entities like MIRI, FHI, OpenAI, the Alignment forum, and so on. Most of these organizations are really young, often less than 5 years old — and I think it’s fair to say that they’ve been a little insular as well. Because if you think about AI alignment as a field, and the problems its trying to solve, you’d think it must be this really interdisciplinary field that sits at the intersection of broader disciplines, like human-computer interaction, cognitive science, AI ethics, and philosophy.

The relative lack of overlap between the AI alignment community and related disciplines.

But to my knowledge, there actually isn’t very much overlap between these communities — it’s more off-to-the-side, like in the picture above. There are reasons for this, which I’ll get to, and it’s already starting to change, but I think it partly explains the relatively narrow philosophical horizons of the AI alignment community.

Philosophical tendencies in AI alignment

So what are these horizons? I’m going to lay out 5 philosophical tendencies that I’ve perceived in the work that comes out of the AI alignment community — so this is inevitably going to be subjective — but it’s based on the work that gets highlighted in venues like the Alignment Newsletter, or that gets discussed on the AI Alignment forum.

Five philosophical tendencies of contemporary AI alignment research:
(1) Connectionism, (2) Behaviorism, (3) Humeanism, (4) Decision-Theoretic Rationality, (5) Consequentialism.

1. Connectionist (vs. symbolic)

First there’s a tendency towards connectionism — the position that knowledge is best stored as sub-symbolic weights in neural networks, rather than language-like symbols. You see this in emphasis on deep learning interpretability, scalability, and robustness.

2. Behaviorist (vs. cognitivist)

Second, there’s a tendency towards behaviorism — that to build human-aligned AI, we can model or mimic humans as these reinforcement learning agents, which avoid reasoning or planning by just learning from lifetimes and lifetimes of data. This in contrast to more cognitive approaches to AI, which emphasize the ability to reason with and manipulate abstract models of the world.

3. Humean (vs. Kantian)

Third, there’s a implicit tendency towards Humean theories of motivation — that we can model humans as motivated by reward signals they receive from the environment, which you might think of as “desires”, or “passions” as David Hume called them. This is in contrast more Kantian theories of motivation, which leave more room for humans to also be motivated by reasons, e.g., commitments, intentions, or moral principles.

4. Rationality as decision-theoretic (vs. reasonableness /​ sense-making)

Fourth, there’s a tendency to view rationality solely in decision theoretic terms — that is, rationality is about maximizing expected value, where probabilities are rationally updated in a Bayesian manner. But historically, in philosophy, there’s been a lot more to norms of reasoning and rationality than just that — rationality is also about logic, and argumentation and dialectic. Broadly, it’s about what it makes sense for a person to think or do, including what it makes sense for a person to value in the first place.

5. Consequentialist (vs. non-consequentialist)

Finally, there’s a tendency towards consequentialism — consequentialism in the broad sense that value and ethics are about outcomes or states of affairs. This excludes views that root value/​ethics in evaluative attitudes, deontic norms, or contractualism.

From parochialism to pluralism

By laying out these tendencies, I want to suggest that the predominant views within AI alignment live within a relatively small corner of the full space of contemporary philosophical positions. If this is true, this should give reason for pause. Why these tendencies? Of course, it’s partly that a lot of very smart people thought very hard about these things, and this is what made sense to them. But very smart people may still be systematically biased by their intellectual environments and trajectories.

Might this be happening with AI alignment researchers? It’s worth noting that the first three of these tendencies are very much influenced by recent successes of deep learning and reinforcement learning in AI. In fact, prior to these successes, a lot of work in AI was more on the other end of the spectrum: first order logic, classical planning, cognitive systems, etc. One worry then, is that the attention of AI alignment researchers might be unduly influenced by the success or popularity of contemporary AI paradigms.

It’s also notable that the last two of these tendencies are largely inherited from disciplines like economics, computer science, and communities like effective altruism. Another worry then, would be that these origins have unduly influenced the paradigms and concepts that we take as foundational.

So at this point, I hope to have shown how the AI alignment research community exists in a bit of a philosophical bubble. And so in that sense, if you’ll forgive the term, the community is rather parochial.

Reasons for parochialism, and steps towards pluralism.

And there are understandable reasons for this. For one, AI alignment is still a young field, and hasn’t reached a more diverse pool of researchers. Until more recently, It’s also been excluded and not taken very seriously within traditional academia, leading to a lack of intra-disciplinary and inter-disciplinary conversation, and a continued suspicion in some quarters about academia. Obviously, there are also strong founder effects due to the field’s emergence within rationalist and EA communities. And like much of AI and STEM, it inherits barriers to participation from an unjust world.

These can be, and in my opinion, should be addressed. As the field grows, we could make sure it includes more disciplinary and community outsiders. We could foster greater inter-disciplinary collaboration within academia. We could better recognize how founder effects may bias our search through the space of relevant ideas. And we could lower the barriers to participation, while countering unjust selection effects.

Why pluralism? (And not just diversity?)

By why bother? What exactly is the value in breaking out of this philosophical bubble? I haven’t quite explained that yet, so I’ll do that now. And why do I use the word pluralism in particular, as opposed to just diversity? I chose it because I wanted it to evoke something more than just diversity.

By philosophical pluralism, I mean to include philosophical diversity, by which I mean serious engagement with multiple philosophical traditions and disciplinary paradigms. But I also mean openness to the possibility that the problem of aligning AI might have multiple good answers, and that we need to contend with how to do that. Having defined those terms, let’s get into the reasons.

A summary of reasons for philosophical pluralism in AI alignment.

1. Avoiding the streetlight fallacy

The first is avoiding the streetlight fallacy — that if we simply keep exploring the philosophy that’s familiar to Western-educated elites, we are likely to miss out on huge swathes of human thought that may have crucial relevance to AI alignment.

Jay Garfield puts this quite sharply in his book on Engaging Buddhism. Speaking to Western philosophers about Buddhist philosophy, he argues that Buddhist philosophy shares too many concerns with Western philosophy to be ignored:

“Contemporary philosophy cannot continue to be practiced in the West in ignorance of the Buddhist tradition. … Its concerns overlap with those of Western philosophy too broadly to dismiss it as irrelevant. Its perspectives are sufficiently distinct that we cannot see it as simply redundant. Close enough for conversation; distant enough for that conversation to be one from which we might learn. … [T]o continue to ignore Buddhist philosophy (and by extension, Chinese philosophy, non-Buddhist Indian philosophy, African philosophy, Native American philosophy...) is indefensible.”

— Jay Garfield, Engaging Buddhism: Why It Matters to Philosophy (2015)

2. Robustness to moral and normative uncertainty

The second is robustness to moral and normative uncertainty. If you’re unsure about what the right thing to do is, or to align an AI towards, and you think it’s plausible that other philosophical perspectives might have good answers, then it’s reasonable to diversify our resources to incorporate them.

This is similar to the argument that Open Philanthropy makes for worldview diversification (and related to the informational situation of having imprecise credences, discussed briefly by MacAskill, Bykvist and Ord in Moral Uncertainty):

“When deciding between worldviews, there is a case to be made for simply taking our best guess, and sticking with it. If we did this, we would focus exclusively on animal welfare, or on global catastrophic risks, or global health and development, or on another category of giving, with no attention to the others. However, that’s not the approach we’re currently taking. Instead, we’re practicing worldview diversification: putting significant resources behind each worldview that we find highly plausible. We think it’s possible for us to be a transformative funder in each of a number of different causes, and we don’t—as of today—want to pass up that opportunity to focus exclusively on one and get rapidly diminishing returns.”

— Holden Karnofsky, Open Philanthropy CEO, Worldview Diversification (2016)

3. Pluralism as (political) pragmatism

The third is pluralism as a form political pragmatism. As Iason Gabriel at DeepMind writes: In the absence of moral agreement, is there a fair way to decide what principles AI should align with? Gabriel doesn’t really put it this way, but one way to interpret this is that, pluralism is pragmatic because it’s the only way we’re going to get buy in from disparate political actors.

“[W]e need to be clear about the challenge at hand. For the task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines. Rather, it is to find a way of selecting appropriate principles that is compatible with the fact that we live in a diverse world, where people hold a variety of reasonable and contrasting beliefs about value. … To avoid a situation in which some people simply impose their values on others, we need to ask a different question: In the absence of moral agreement, is there a fair way to decide what principles AI should align with?

— Iason Gabriel, DeepMind, Artificial Intelligence, Values, and Alignment (2020)

4. Pluralism as respect for the equality and autonomy of persons

Finally, there’s pluralism as an ethical commitment in itself — pluralism as respect for the equality and autonomy of persons to choose what values and ideals matter to them. This is the reason I personally find the most compelling — I think in order to preserve a lot of what we care about in this world, we need aligned AI to respect this plurality of value.

Elizabeth Anderson puts this quite beautifully in her book, Value in Ethics and Economics. Noting that individuals may rationally adopt or uphold a great diversity of worthwhile ideals, she argues that we lack good reason for impersonally ranking all legitimate ways of life on some universal scale. If we accept that there may be conflicting yet legitimate philosophies about what constitutes a good life, then we also have to accept that there maybe multiple incommensurable scales of value that matter to people:

“There is a great diversity of worthwhile ideals, not all of which can be combined in a single life. … Individuals with different talents, temperaments, interests, opportunities, and relations to others rationally adopt or uphold different ideals. … In [a] liberal, pluralist, egalitarian society, there is no longer any point in impersonally ranking all legitimate ways of life on some hierarchy of intrinsic value. Plural and conflicting yet legitimate ideals will tell different people to value different [ways of living], and there is no point in insisting that a single ranking is impersonally valid for everyone.”

— Elizabeth Anderson, Value in Ethics and Economics(1995)

The relevance of non-Western philosophy

So that’s why I think pluralism matters to AI alignment. Perhaps you buy that, but perhaps it’s hard to think of concrete examples where non-dominant philosophies may be relevant to alignment research. So now I’d just like to offer a few. I think non-Western philosophy might be especially relevant to the following open problems in AI alignment:

3 areas where non-Western philosophies may be relevant to AI alignment
  1. Representing and learning human norms. What are norms? How do they constrain our actions or shape our values? How do learners infer and internalize them from their social environments? Classical Chinese ethics, especially Confucian ethics, could provide some insights.

  2. Robustness to ontological shifts and crises. We typically value the world in terms of the objects and relations we use to represent it. But what should an agent do when those representations undergo (transformative) shifts? Certain schools of Buddhist metaphysics bear directly on these questions.

  3. The phenomenology of valuing (e.g. desiring) and disvaluing (e.g. suffering). We value different things in different ways, with different subjective experiences. What are these varieties of experience, and how should they inform agents that try to learn what we value? Buddhist, Jain and Vedic philosophy have been very much centered on the nature of these subjective experiences, and could provide answers.

Before I go on, I also wanted to note that this is primarily drawn from only the limited about of Chinese and Buddhist philosophy I’m familiar with. This is certainly not all of non-Western philosophy, and there’s a lot more out there, outside of the streetlight, that may be relevant.

1. Representing and learning human norms

Why do social norms and practices matter? One answer that’s common from game theory is that norms have instrumental value as coordinating devices or unspoken agreements. To the extent that we need AI to coordinate well with humans then, we may need AI to learn and follow these norms.

If you look to Confucian ethics however, you get a quite different picture. On one possible interpretation of Confucian thought, norms and practices are understood to have intrinsic value as evaluative standards and expressive acts. You can see this for example, in the Analects, which are attributed to Confucius:

Restraining the self and returning to ritual (禮 /​ li) constitutes humaneness (仁 /​ ren).

Analects 12.1

This word, li (禮), is hard to translate, but means something like ritual propriety or etiquette. And it recurs again and again in Confucian thought. This particular line suggests a central role for ritual in what Confucians thought of as a benevolent, humane and virtuous life. How to interpret this? Kwong-loi Shun suggests that this is because, while ritual forms may just be conventions, without these conventions, important evaluative attitudes like respect or reverence cannot be made intelligible or expressed:

Kwong-loi Shun [...] holds that on the one hand, a particular set of ritual forms are the conventions that a community has evolved, and without such forms attitudes such as respect or reverence cannot be made intelligible or expressed (the truth behind the definitionalist interpretation). In this sense, li constitutes ren within or for a given community. On the other hand, different communities may have different conventions that express respect or reverence, and moreover any given community may revise its conventions in piecemeal though not wholesale fashion (the truth behind the instrumentalist interpretation).

— David Wong, Chinese Ethics (Stanford Encyclopedia of Philosophy)

I was quite struck by this when I first encountered it — partly because I grew up finding a lot of Confucian thought really pointless and oppressive. And to be clear, some norms are oppressive. But I recently encountered a very similar idea in the work of Elizabeth Anderson (cited earlier) that made me come around more to it. In speaking about how individuals value things, and where we get these values from, Anderson argues that:

“Individuals are not self-sufficient in their capacity to value things in different ways. I am capable of valuing something in a particular way only in a social setting that upholds norms for that mode of valuation. I cannot honor someone outside of a social context in which certain actions … are commonly understood to express honor.”

— Elizabeth Anderson, Value in Ethics and Economics (1995)

I find this really compelling. If you think about what constitutes good art, or literature, or beauty, all of that is undoubtedly tied up in norms about how to value things, and how to express those values.

If this is right, then there’s a sense in which the game theoretic account of norms has got things exactly reversed. In game theory, it’s assumed that norms emerge out of the interaction of individual preferences, and so are secondary. But for Confucians, and Anderson, it’s the opposite: norms are primary, or at least a lot of them are, and what we individually value is shaped by those norms.

This would suggest a pretty deep re-orientation of what AI alignment approaches that learn human values need to do. Rather than learn individual values, then figure out how to balance them across society, we need to consider that many values are social from the outset.

All of this dovetails quite nicely with one of the key insights in the paper Incomplete Contracting and AI Alignment:

“Building AI that can reliably learn, predict, and respond to a human community’s normative structure is a distinct research program to building AI that can learn human preferences. … To the extent that preferences merely capture the valuation an agent places on different courses of action with normative salience to a group, preferences are the outcome of the process of evaluating likely community responses and choosing actions on that basis, not a primitive of choice.

— Hadfield-Menell & Hadfield, Incomplete Contracting & AI Alignment (2018)

Here again, we see re-iterated idea that social norms constitute (at least some) individual preferences. What all of this suggests is that, if we want to accurately model human preferences, we may need to model the causal and social processes by which individuals learn and internalize norms: observation, instruction, ritual practice, punishment, etc.

Furthermore, when it comes to human values, then at least in some domains (e.g. what is beautiful, racist, admirable, or just), we ought to identify what’s valuable not with the revealed preference or even the reflective judgement of a single individual, but with the outcome of some evaluative social process that takes into account pre-existing standards of valuation, particular features of the entity under evaluation, and potentially competing reasons for applying, not applying, or revising those standards.

As it happens, this anti-individualist approach to valuation isn’t particularly prominent in Western philosophical thought (but again, see Anderson). Perhaps then, by looking towards philosophical traditions like Confucianism, we can develop a better sense of how these normative social processes should be modeled.

2. Robustness to ontological shifts and crises

Let’s turn now to a somewhat old problem, first posed by MIRI in 2011: An agent defines its objective based on how it represents the world — but what should happen when that representation is changed?

“An agent’s goal, or utility function, may also be specified in terms of the states of, or entities within, its ontology. If the agent may upgrade or replace its ontology, it faces a crisis: the agent’s original goal may not be well-defined with respect to its new ontology. This crisis must be resolved before the agent can make plans towards achieving its goals.”

— Peter De Blanc, MIRI, Ontological Crises in AI Value Systems (2011)

As it turns out, Buddhist philosophy might provide some answers. To see how, it’s worth comparing it against more commonplace views about the nature of reality and the objects within it. Most of us grow up as what you might call naive realists, believing:

Naive Realism. Through our senses, we perceive the world and its objects directly.

But then we grow up and study some science, and encounter optical illusions, and maybe become representational realists instead:

Representational Realism. We indirectly construct representations of the external world from sense data, but the world being represented is real.

Now, Madhyamaka Buddhism goes further — it rejects the idea that there is anything ultimately real or true. Instead, all facts are at best conventionally true. And while there may exist some mind-independent external world, there is no uniquely privileged representation of that world that is the “correct” one. However some representations are still better for alleviating suffering than others, and so part of the goal of Buddhist practice is to see through our everyday representations as merely conventional, and to adopt representations better suited for alleviating suffering.

This view is demonstrated in The Vimalakīrti Sutra, which actually uses gender as an example of a concept that should be seen through as conventional. I was quite astounded when I first read it, because the topic feels so current, but the text is actually 1800 years old:

The reverend monk, Śāriputra, asks a Goddess why she does not transform her female body into a male body, since she is supposed to be enlightened. In response, she swaps both their bodies, and explains:

Śāriputra, if you were able to transform
This female body,
Then all women would be able to transform as well.

Just as Śāriputra is not female
But manifests a female body
So are all women likewise:

Although they manifest female bodies
They are not, inherently, female.

Therefore, the Buddha has explained
That all phenomena are neither female nor male.

The Vimalakīrti Sutra (circa. 200 CE)

All this actually closely resonates, in my opinion, with a recent movement in Western analytic philosophy called conceptual engineering — the idea that we should re-engineer concepts to suit our purposes. For example, Sally Haslanger at MIT has applied this approach in her writings on gender and race, arguing that feminists and anti-racists need to revise these concepts to better suit feminist and anti-racist ends.

I think this methodology is actually really promising way to deal with the question of ontological shifts. Rather than framing ontological shifts as quasi-exogenous occurrence that agents have to respond to, it frames them as meta-cognitive choices that we select with particular ends in mind. It almost suggests this iterative algorithm for changing our representations of the world:

  1. Fix some evaluative concepts (e.g., accuracy, well-being) and lower-level primitives.

  2. Refine other concepts to do better with respect those evaluative concepts.

  3. Adjust those evaluative concepts and lower-level primitives in response.

  4. Repeat as necessary.

How exactly this would work, and whether it would lead to reasonable outcomes, is, I think, really fruitful and open research terrain. I see MIRI’s recent work on Cartesian Frames as a very promising step in this direction, by formalizing the ways in which we might carve up the world into “self” and “other”. When it comes to epistemic values, steps have also been made towards formalizing approximate causal abstractions. And of course, the importance of representational choice for efficient planning has been known since the 60s. What remains lacking is a theory of when and how to apply these representational shifts according to an initial set of desiderata, and then how to reconceive those desiderata in response.

3. The phenomenology of valuing and dis-valuing

On to the final topic of relevance. In AI and economics, it’s very common to just talk about human values in terms of this barebones concept of preference. Preference is an example of what you might call a thin evaluative attitude, which doesn’t have any deeper meaning beyond imposing a certain ordering over actions or outcomes.

In contrast, I think all of us familiar with a much wider range of evaluative attitudes and experiences: respect, admiration, love, shock, boredom, and so on. These are thick evaluative attitudes. And work in AI alignment hasn’t really tried to account for them. Instead, there’s a tendency to collapse everything into this monolithic concept of “reward”.

And I think that’s very dangerous — we’re not paying attention to the full range of subjective experience, and that may lead to catastrophic outcomes. Instead, I think we need to be engaging more with psychology, phenomenology, and neuroscience. For example, there’s work in the field of neurophenomenology that I think might be really promising for answering some of these questions:

“The use of first-person and second-person phenomenological methods to obtain original and refined first-person data is central to neurophenomenology. It seems true both that people vary in their abilities as observers and reporters of their own experiences and that these abilities can be enhanced through various methods. First-person methods are disciplined practices that subjects can use to increase their sensitivity to their own experiences at various time-scales. These practices involve the systematic training of attention and self-regulation of emotion. Such practices exist in phenomenology, psychotherapy, and contemplative meditative traditions. Using these methods, subjects may be able to gain access to aspects of their experience, such as transient affective state and quality of attention, that otherwise would remain unnoticed and unavailable for verbal report.”

— Thompson et al, Neurophenomenology: An Introduction for Neurophilosophers (2010)

Unsurprisingly, this work is very much informed by engagement with Buddhist, Jain, and Vedic philosophy and practice, because they are entire philosophical practices devoted to questions like “What is the nature of desire?”, “What is the nature of suffering?” and “What are the various mental factors that lead to one or the other?”

Does AI alignment require understanding human subjective experience at the incredibly fine level of detail aimed at by neurophenomenology and contemplative traditions? My intuition is that it won’t, simply because we humans are capable of being helpful and beneficial without fulling understanding each others’ minds. But we do understand at least that we all have different subjective experiences, which we may value or take as motivating in different ways.

This level of intuitive psychology, I believe, is likely to be necessary for alignment. And AI as a field is nowhere near it. Research into “emotion recognition”, which is perhaps the closest that AI has gotten to these questions, typically reifies emotion into 6 fixed categories, which is not much better than collapsing everything into “reward”. Given that contemplative Dharmic philosophy has long developed systematic methods for investigating the experiential nature of mind, as well as theories about how higher-order awareness relates to experience, it bears promise for informing how AI could learn theories of emotion and evaluative experience, rather than simply having them hard-coded.

Just as a final illustration of why the study evaluative experience is important, I want to highlight a question that often comes up in Buddhist philosophy: How can one act effectively in the world without experiencing desire or suffering? Unless you’re interested in attaining awakening, it may not be so relevant to humans, nor to AI alignment per se. But it becomes very relevant once we consider the possibility that we might build AI that suffers itself. In fact, there’s a recent paper on exactly this topic asking: How can we build functionally effective conscious AI without suffering?

“The possibility of machines suffering at our own hands … only applies if the AI that we create or cause to emerge becomes conscious and thereby capable of suffering. In this paper, we examine the nature of the relevant kind of conscious experience, the potential functional reasons for endowing an AI with the capacity for feeling and therefore for suffering, and some of the possible ways of retaining the functional advantages of consciousness, whatever they are, while avoiding the attendant suffering.”

— Agarwal & Edelman, Functionally Effective Conscious AI Without Suffering (2020)

The worry here is that consciousness may have evolved in animals because it serves some function, and so, AI might only reach human-level usefulness if it is conscious. And if it is conscious, it could suffer. Most of us who care about sentient beings besides humans would want to make sure that AI doesn’t suffer — we don’t want to create a race of artificial slaves. So that’s why it might be really important to figure out whether agents can have functional consciousness without suffering.

To address this question, Agarwal & Edelman draw explicitly upon Buddhist philosophy, suggesting that suffering arises from identification with a phenomenal model of the self, and that by transcending that identification, suffering no longer occurs:

The final approach … targets the phenomenology of identification with the phenomenal self model (PSM) as an antidote to suffering. … Metzinger [2018] describes the unit of identification (UI) as that which the system consciously identifies itself with. Ordinarily, when the PSM is transparent, the system identifies with its PSM, and is thus conscious of itself as a self. But it is at least a logical possibility that the UI may not be limited to the PSM, but be shifted to the most “general phenomenal property” [Metzinger, 2017] of knowing common to all phenomenality including the sense of self. In this special condition, the typical subject-object duality of experience would dissolve; negatively valenced experiences could still occur, but they would not amount to suffering because the system would no longer be experientially subject to them.

— Agarwal & Edelman, Functionally Effective Conscious AI Without Suffering (2020)

No doubt, this is an imprecise — and likely contentious — definition of “suffering”, one which affords a very particular solution due to the way it is defined. But at the very least, the paper makes a valiant attempt towards formalizing, computationally, what suffering even might be. If we want to avoid creating machines that suffer, more research like this needs to be conducted, and we might do well to pay attention to Buddhist and related philosophies in the process.


With that, I’ll end my whirlwind tour of non-Western philosophy, and offer some key takeaways and steps forward.

What I hope to have shown with this talk is that AI alignment research has drawn from a relatively narrow set of philosophical perspectives. Expanding this set, for example, with non-Western philosophy, could provide fresh insights, and reduce the risk of misalignment.

In order to address this, I’d like to suggest that prospective researchers and funders in AI alignment should consider a wider range of disciplines and approaches. In addition, while support for alignment research has grown in CS departments, we may need to increase support in other fields, in order to foster the inter-disciplinary expertise needed for this daunting challenge.

If you enjoyed this talk, and would like to learn more about AI alignment, pluralism, or non-Western philosophy here are some reading recommendations. Thank you for your attention, and I look forward to your questions.