The Preference Fulfillment Hypothesis

Short version

Humans have an innate motivation (“preference fulfillment”, PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word “love”, as well as related words such as “kindness” and “compassion”.

I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer.

I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/​roles/​simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve both the actual people present in the situation as well as various other people whose opinions we’ve internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation.

Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance.

If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/​PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities.

Long version

The preference fulfillment hypothesis

Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled.

It might be very simple (“I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate”), but it might get deep into pretty fundamental differences in preferences and values (“I’m deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way”).

It’s not necessarily absolute—some things you might still find really upsetting and you’d still want to override the other person’s preferences in some cases—but you can at least feel the “I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied” thing to some extent.

I think this kind of desire is something like its own distinct motivation in the human mind. It can easily be suppressed by other kinds of motivations kicking in—e.g. if the other person getting what they wanted made you feel jealous or insecure, or if their preferences involved actively hurting you. But if those other motivations aren’t blocking it, it can easily bubble up. Helping other people often just feels intrinsically good, even if you know for sure that you yourself will never get any benefit out of it (e.g. holding a door open for a perfect stranger in a city you’re visiting and will probably never come back to).

The motivation seems to work by something like simulating the other person based on what you know of them (or people in general), and predicting what they would want in various situations. This is similar to how “shoulder advisors” are predictive models that simulate what someone you know would react in a particular situation, and also somewhat similar to how large language models simulate the way a human would continue a piece of writing. The thought of the (simulated/​actual) person getting what they want (or just existing in the first place) then comes to be experienced as intrinsically pleasing.

A friend of mine collects ball-jointed dolls (or at least used to); I don’t particularly care about them, but I like the thought of my friend collecting them and having them on display, because I know it’s important for my friend. If I hear about my friend getting a new doll, then my mental simulation of her predicts that she will enjoy it, and that simulated outcome makes me happy. If I were to see some doll that I thought she might like, I would enjoy letting her know, because my simulation of her would appreciate finding out about that doll.

If I now think of her spending time with her hobby and finding it rewarding, then I feel happy about that. Basically, I’m running a mental simulation of what I think she’s doing, and that simulation makes me happy.

While I don’t know exactly how, this algorithm seems corrigible. If it turned out that my friend had lost her interest in ball-jointed dolls, then I’d like to know that so that I could better fulfill her preferences.

The kinds of normal people who aren’t on Less Wrong inventing needlessly convoluted technical-sounding ways of expressing everyday concepts would probably call this thing “love” or “caring”. And genuine love (towards a romantic partner, close friend, or child/​parent) definitely involves experiencing what I have just described. Terms such as kindness and compassion are also closely related. To avoid bringing in possibly unwanted connotations from those common terms, I’ll call this thing “preference fulfillment” or PF for short.

Preference fulfillment: a motivational drive that simulates the preferences of other people (or animals) and associates a positive reward with the thought of them getting their preferences fulfilled. Also associates a positive reward with thought of them merely existing.

I hypothesize that PF (or the common sense of the word “love”) is merely adding one additional piece (the one that makes you care about the simulations) to an underlying prediction and simulation machinery that is already there and exists for making social interaction and coordination possible in the first place.

Cooperation requires simulation

In this section, I’ll say a few words about why running these kinds of simulations of other people seems to be a prerequisite for any kind of coordination we do daily.

Under the “virtual bargaining” model of cooperation, people coordinate without communication by behaving on the basis of what they would agree to do if they were explicitly to bargain, provided the agreement that would arise from such discussion is commonly known.

A simple example is that of two people carrying a table across the room: who should grab which end of the table? Normally, the natural solution is for each to grab the side that minimizes the joint distance moved (see picture). However, if one of the people happens to be a despot and the other a servant, then the natural solution is for the despot to grab the end that’s closest to them, forcing the servant to walk the longer distance.

This kind of coordination tends can happen automatically and wordlessly as long as we have some model of the other person’s preferences. Mutual simulation is also still required even if the slave and the despot hate each other—in order to not get punished for being a bad servant, the servant still needs to simulate the despot’s desires. And the despot needs to simulate the servant’s preferences in order to know what the servant will do in different situations.

I think this kind of a mental simulation is on some level active in basically every situation where we interact with other people. If you are in a grocery store, you know not to suddenly take off your clothes and start dancing in the middle of the store, because you know that the other people would stare at you and maybe call the police. You also know how you are expected to interact with the clerk, and the steps involved in the verbal dance of “hello how are you, yes that will be all, thank you, have a nice day”.

As a child, you also witnessed how adults acted in a store. You are probably also running some simulation of “how does a normal kind of a person (like my parents) act in a grocery store”, and intuitively trying to match that behavior. In contrast, if you’re suddenly put into a situation where you don’t have a good model of how to act (maybe in a foreign country where the store seems to act differently from how you’re used to) and can’t simulate the reactions of other people in advance, you may find yourself feeling anxious.

ChatGPT may be an alien entity wearing a human-like mask. Meanwhile, humans may be non-alien entities wearing person-like masks. It’s interesting to compare the Shoggoth-ChatGPT meme picture below, with Kevin Simler’s comic of personhood.

Kevin writes:

A person (as such) is a social fiction: an abstraction specifying the contract for an idealized interaction partner. Most of our institutions, even whole civilizations, are built to this interface — but fundamentally we are human beings, i.e., mere creatures. Some of us implement the person interface, but many of us (such as infants or the profoundly psychotic) don’t. Even the most ironclad person among us will find herself the occasional subject of an outburst or breakdown that reveals what a leaky abstraction her personhood really is.

And offers us this comic:

So for example, a customer in a grocery store will wear the “grocery store shopper” mask; the grocery store clerk will wear the “grocery store clerk” mask. That way, both will act the way that’s expected of them, rather than stripping their clothes off and doing a naked dance. And this act of wearing a mask seems to involve running a simulation of what “a typical grocery store person” would do and how other people would react to various behaviors in the store. We’re naturally wired to use these masks/​roles/​simulations to determine the right behavior in any social situation.

Some of the other people being simulated are the actual other people in the store, others are various people whose opinions you’ve internalized and care about. E.g. if you ever had someone shame you for a particular behavior, even when that person isn’t physically present, a part of your mind may be simulating that person as an “inner critic” who will virtually shame you for the thought of any such behavior.

And even though people do constantly misunderstand each other, we don’t usually descend to Outcome Pump levels of misunderstanding (where you ask me to get your mother out of a burning building and I blow up the building so that she gets out but is also killed in the process, because you never specified that you wanted her to get out alive). The much more common scenario are countless of minor interactions of the type where people just go to a grocery store and act the way a normal grocery store shopper would, or where two people glance at a table that needs to be carried and wordlessly know who should grab which end.

Preference fulfillment may be natural

PF then, is when you take your already-existing simulation of what other people would want, and just add a bit of an extra component that makes you intrinsically value those people getting what your simulation says they want. In the grocery store, it’s possible that you’re just trying to fulfill the preferences of others because you think you’d be shamed if you didn’t. But if you genuinely care about someone, then you actually intrinsically care about seeing their preferences fulfilled.

Of course, it’s also possible to genuinely care about other people in a grocery store (as well as to be afraid of a loved one shaming you). In fact, correctly performing a social role can feel enjoyable by itself.

Even when you don’t feel like you love someone in the traditional sense of the word, some of the PF drive seems to be active in most social situations. Wordlessly coordinating on things like how to carry the table or how to move can feel intrinsically satisfying, assuming that there are no negative feelings such as fear blocking the sastisfaction. (At least in my personal experience, and also evidenced by the appeal of activities such as dance.)

The thesis that PF involves simulating others + intrinsically valuing the satisfaction of their preferences stands in contrast with models such as the one in “Niceness is Unnatural”, which holds that

the specific way that the niceness/​kindness/​compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment.

The preference fulfillment hypothesis is that the exact details of when niceness/​kindness/​compassion/​love/​PF is allowed to express itself is indeed very contingent on the specifics of our ancestral environment. That is, our brains have lots of complicated rules for when to experience PF towards other people, and when to feel hate/​envy/​jealousy/​fear/​submission/​dominance/​transactionality/​etc. instead, and the details of those rules are indeed shaped by the exact details of our evolutionary history. It’s also true that the specific social roles that we take are very contingent on the exact details of our evolution and our culture.

But the motivation of PF (“niceness”) itself is simple and natural—if you have an intelligence that is capable of acting as a social animal and doing what other social animals ask from it, then it already has most of the machinery required for it to also implement PF as a fundamental intrinsic drive. If you can simulate others, you only need to add the component that intrinsically values those simulations getting what they want.

This implies that capabilities progress may be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better, so as to understand what exactly would satisfy the preferences of the humans. The way we’re depicting large language models as shoggoths wearing a human mask, suggest that they are already starting to do so. While they may often “misunderstand” your intent, they already seem to be better at it than a pure Outcome Pump would be.

If the ability to simulate others in a way sufficient to coordinate with them forms most of the machinery required for PF, then capabilities progress might deliver most of the progress necessary for alignment.

Some kind of a desire to simulate and fulfill the desires of others seems to show up very early. Infants have it. Animals being trained have their learning accelerated once they figure out they’re being trained and start proactively trying to figure out what the trainer intends. Both point to these being simple and natural competencies.

Humans are often untrustworthy because of all the conflicting motivations and fears they’re running. (“If I feel insecure about my position and the other person seems likely to steal it, suppress love and fear/​envy/​hate them instead.”) However, an AI wouldn’t need to exhibit any of the evolutionary urges for backstabbing and the like. We could take the prediction + love machinery and make that the AI’s sole motivation (maybe supplemented by some other drives such as intrinsic curiosity to boost learning).

On the other hand

Of course, this does not solve all problems with alignment. A huge chunk of how humans simulate each other seems to make use of structural similarities. Or as Niceness is unnatural also notes:

It looks pretty plausible to me that humans model other human beings using the same architecture that they use to model themselves. This seems pretty plausible a-priori as an algorithmic shortcut — a human and its peers are both human, so machinery for self-modeling will also tend to be useful for modeling others — and also seems pretty plausible a-priori as a way for evolution to stumble into self-modeling in the first place (“we’ve already got a brain-modeler sitting around, thanks to all that effort we put into keeping track of tribal politics”).

Under this hypothesis, it’s plausibly pretty easy for imaginations of others’ pain to trigger pain in a human mind, because the other-models and the self-models are already in a very compatible format.

This seems true to me. On the other hand, LLMs are definitely running a very non-humanlike cognitive architecture, and seem to at least sometimes manage a decent simulation. People on the autistic spectrum may also have the experience of understanding other people better than neurotypicals do. The autistics had to compensate for their lack of “hardware-accelerated” intuitive social modeling by coming with explicit models of what drives the behavior of other people, until they got better at it than people who never needed to develop those models. And humans often seem to have significant differences [1, 2] in how their minds work, but still manage to model each other decently, especially if someone tells them about those differences so that they can update their models.

Another difficulty is that humans also seem to incorporate various ethical considerations into their model—e.g. we might feel okay with sometimes overriding the preferences of a young child or a mentally ill person, out of the assumption that their future self would endorse and be grateful for it. Many of these considerations seem strongly culturally contingent, and don’t seem to have objective answers.

And of course, even though humans are often pretty good at modeling each other, it’s also the case that they still frequently fail and mispredict what someone else would want. Just because you care about fulfilling another person’s preferences does not mean that you have omniscient access to them. (It does seem to make you corrigible with regard to fulfilling them, though.)

I sometimes see people suggesting things like “the main question is whether AI will kill everyone or not; compared to that, it’s pretty irrelevant which nation builds the AI first”. On the preference fulfillment model, it might be the other way around. Maybe it’s relatively easy to make an AI that doesn’t want to kill everyone, as long as you set it to fulfill the preferences of a particular existing person who doesn’t want to kill everyone. But maybe it’s also easy to anchor it into the preferences of one particular person or one particular group of people (possibly by some process analogous to how children initially anchor into the desires of their primary caregivers), without caring about the preferences of anyone else. In that case, it might impose the values of that small group on the world, where those values might be arbitrarily malevolent or just indifferent towards others.