The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

An AI ac­tively try­ing to figure out what I want might show me snap­shots of differ­ent pos­si­ble wor­lds and ask me to rank them. Of course, I do not have the pro­cess­ing power to ex­am­ine en­tire wor­lds; all I can re­ally do is look at some pic­tures or video or de­scrip­tions. The AI might show me a bunch of pic­tures from one world in which a geno­cide is quietly tak­ing place in some ob­scure third-world na­tion, and an­other in which no such geno­cide takes place. Un­less the AI already con­sid­ers that dis­tinc­tion im­por­tant enough to draw my at­ten­tion to it, I prob­a­bly won’t no­tice it from the pic­tures, and I’ll rank those wor­lds similarly—even though I’d pre­fer the one with­out the geno­cide. Even if the AI does hap­pen to show me some mass graves (prob­a­bly sec­ond­hand, e.g. in pic­tures of news broad­casts), and I rank them low, it may just learn that I pre­fer my geno­cides un­der-the-radar.

The ob­vi­ous point of such an ex­am­ple is that an AI should op­ti­mize for the real-world things I value, not just my es­ti­mates of those things. I don’t just want to think my val­ues are satis­fied, I want them to ac­tu­ally be satis­fied. Un­for­tu­nately, this poses a con­cep­tual difficulty: what if I value the hap­piness of ghosts? I don’t just want to think ghosts are happy, I want ghosts to ac­tu­ally be happy. What, then, should the AI do if there are no ghosts?

Hu­man “val­ues” are defined within the con­text of hu­mans’ world-mod­els, and don’t nec­es­sar­ily make any sense at all out­side of the model (i.e. in the real world). Try­ing to talk about my val­ues “ac­tu­ally be­ing satis­fied” is a type er­ror.

Some points to em­pha­size here:

  • My val­ues are not just a func­tion of my sense data, they are a func­tion of the state of the whole world, in­clud­ing parts I can’t see—e.g. I value the hap­piness of peo­ple I will never meet.

  • I can­not ac­tu­ally figure out or pro­cess the state of the whole world

  • … there­fore, my val­ues are a func­tion of things I do not know and will not ever know—e.g. whether some­one I will never en­counter is happy right now

  • This isn’t just a limited pro­cess­ing prob­lem; I do not have enough data to figure out all these things I value, even in prin­ci­ple.

  • This isn’t just a prob­lem of not enough data, it’s a prob­lem of what kind of data. My val­ues de­pend on what’s go­ing on “in­side” of things which look the same—e.g. whether a smil­ing face is ac­tu­ally a ric­tus grin

  • This isn’t just a prob­lem of need­ing suffi­ciently low-level data. The things I care about are still ul­ti­mately high-level things, like hu­mans or trees or cars. While the things I value are in prin­ci­ple a func­tion of low-level world state, I don’t di­rectly care about molecules.

  • Some of the things I value may not ac­tu­ally ex­ist—I may sim­ply be wrong about which high-level things in­habit our world.

  • I care about the ac­tual state of things in the world, not my own es­ti­mate of the state—i.e. if the AI tricks me into think­ing things are great (whether in­ten­tional trick­ery or not), that does not make things great.

Th­ese fea­tures make it rather difficult to “point” to val­ues—it’s not just hard to for­mally spec­ify val­ues, it’s hard to even give a way to learn val­ues. It’s hard to say what it is we’re sup­posed to be learn­ing at all. What, ex­actly, are the in­puts to my value-func­tion? It seems like:

  • In­puts to val­ues are not com­plete low-level world states (since peo­ple had val­ues be­fore we knew what quan­tum fields were, and still have val­ues de­spite not know­ing the full state of the world), but…

  • I value the ac­tual state of the world rather than my own es­ti­mate of the world-state (i.e. I want other peo­ple to ac­tu­ally be happy, not just look-to-me like they’re happy).

How can both of those in­tu­itions seem true si­mul­ta­neously? How can the in­puts to my val­ues-func­tion be the ac­tual state of the world, but also high-level ob­jects which may not even ex­ist? What things in the low-level phys­i­cal world are those “high-level ob­jects” point­ing to?

If I want to talk about “ac­tu­ally satis­fy­ing my val­ues” sep­a­rate from my own es­ti­mate of my val­ues, then I need some way to say what the val­ues-rele­vant pieces of my world model are “point­ing to” in the real world.

I think this prob­lem—the “poin­t­ers to val­ues” prob­lem, and the “poin­t­ers” prob­lem more gen­er­ally—is the pri­mary con­cep­tual bar­rier to al­ign­ment right now. This in­cludes al­ign­ment of both “prin­ci­pled” and “pro­saic” AI. The one ma­jor ex­cep­tion is pure hu­man-mimick­ing AI, which suffers from a mostly-un­re­lated set of prob­lems (largely stem­ming from the short­com­ings of hu­mans, es­pe­cially groups of hu­mans).

I have yet to see this prob­lem ex­plained, by it­self, in a way that I’m satis­fied by. I’m steal­ing the name from some of Abram’s posts, and I think he’s point­ing to the same thing I am, but I’m not 100% sure.

The goal of this post is to demon­strate what the prob­lem looks like for a (rel­a­tively) sim­ple Bayesian-util­ity-max­i­miz­ing agent, and what challenges it leads to. This has the draw­back of defin­ing things only within one par­tic­u­lar model, but the ad­van­tage of show­ing how a bunch of nom­i­nally-differ­ent failure modes all fol­low from the same root prob­lem: util­ity is a func­tion of la­tent vari­ables. We’ll look at some spe­cific al­ign­ment strate­gies, and see how and why they fail in this sim­ple model.

One thing I hope peo­ple will take away from this: it’s not the “val­ues” part that’s con­cep­tu­ally difficult, it’s the “poin­t­ers” part.

The Setup

We have a Bayesian ex­pected-util­ity-max­i­miz­ing agent, as a the­o­ret­i­cal stand-in for a hu­man. The agent’s world-model is a causal DAG over vari­ables , and it chooses ac­tions to max­i­mize - i.e. it’s us­ing stan­dard causal de­ci­sion the­ory. We will as­sume the agent has a full-blown Carte­sian bound­ary, so we don’t need to worry about em­bed­ded­ness and all that. In short, this is a text­book-stan­dard causal-rea­son­ing agent.

One catch: the agent’s world-model uses the sorts of tricks in Writ­ing Causal Models Like We Write Pro­grams, so the world-model can rep­re­sent a very large world with­out ever ex­plic­itly eval­u­at­ing prob­a­bil­ities of ev­ery vari­able in the world-model. Sub­mod­els are ex­panded lazily when they’re needed. You can still con­cep­tu­ally think of this as a stan­dard causal DAG, it’s just that the model is lazily eval­u­ated.

In par­tic­u­lar, think­ing of this agent as a hu­man, this means that our hu­man can value the hap­piness of some­one they’ve never met, never thought about, and don’t know ex­ists. The util­ity can be a func­tion of vari­ables which the agent will never com­pute, be­cause the agent never needs to fully com­pute u in or­der to max­i­mize it—it just needs to know how u changes as a func­tion of the vari­ables in­fluenced by its ac­tions.

Key as­sump­tion: most of the vari­ables in the agent’s world-model are not ob­serv­ables. Draw­ing the anal­ogy to hu­mans: most of the things in our world-mod­els are not raw pho­ton counts in our eyes or raw vibra­tion fre­quen­cies/​in­ten­si­ties in our ears. Our world-mod­els in­clude things like trees and rocks and cars, ob­jects whose ex­is­tence and prop­er­ties are in­ferred from the raw sense data. Even lower-level ob­jects, like atoms and molecules, are la­tent vari­ables; the raw data from our eyes and ears does not in­clude the ex­act po­si­tions of atoms in a tree. The raw sense data it­self is not suffi­cient to fully de­ter­mine the val­ues of the la­tent vari­ables, in gen­eral; even a perfect Bayesian rea­soner can­not de­duce the true po­si­tion of ev­ery atom in a tree from a video feed.

Now, the ba­sic prob­lem: our agent’s util­ity func­tion is mostly a func­tion of la­tent vari­ables. Hu­man val­ues are mostly a func­tion of rocks and trees and cars and other hu­mans and the like, not the raw pho­ton counts hit­ting our eye­balls. Hu­man val­ues are over in­ferred vari­ables, not over sense data.

Fur­ther­more, hu­man val­ues are over the “true” val­ues of the la­tents, not our es­ti­mates—e.g. I want other peo­ple to ac­tu­ally be happy, not just to look-to-me like they’re happy. Ul­ti­mately, is the agent’s es­ti­mate of its own util­ity (thus the ex­pec­ta­tion), and the agent may not ever know the “true” value of its own util­ity—i.e. I may pre­fer that some­one who went miss­ing ten years ago lives out a happy life, but I may never find out whether that hap­pened. On the other hand, it’s not clear that there’s a mean­ingful sense in which any “true” util­ity-value ex­ists at all, since the agent’s la­tents may not cor­re­spond to any­thing phys­i­cal—e.g. a hu­man may value the hap­piness of ghosts, which is tricky if ghosts don’t ex­ist in the real world.

On top of all that, some of those vari­ables are im­plicit in the model’s lazy data struc­ture and the agent will never think about them at all. I can value the hap­piness of peo­ple I do not know and will never en­counter or even think about.

So, if an AI is to help op­ti­mize for , then it’s op­ti­miz­ing for some­thing which is a func­tion of la­tent vari­ables in the agent’s model. Those la­tent vari­ables:

  • May not cor­re­spond to any par­tic­u­lar vari­ables in the AI’s world-model and/​or the phys­i­cal world

  • May not be es­ti­mated by the agent at all (be­cause lazy eval­u­a­tion)

  • May not be de­ter­mined by the agent’s ob­served data

… and of course the agent’s model might just not be very good, in terms of pre­dic­tive power.

As usual, nei­ther we (the sys­tem’s de­sign­ers) nor the AI will have di­rect ac­cess to the model; we/​it will only see the agent’s be­hav­ior (i.e. in­put/​out­put) and pos­si­bly a low-level sys­tem in which the agent is em­bed­ded. The agent it­self may have some in­tro­spec­tive ac­cess, but not full or perfectly re­li­able in­tro­spec­tion.

De­spite all that, we want to op­ti­mize for the agent’s util­ity, not just the agent’s es­ti­mate of its util­ity. Other­wise we run into wire­head­ing-like prob­lems, prob­lems with the agent’s world model hav­ing poor pre­dic­tive power, etc. But the agent’s util­ity is a func­tion of la­tents which may not be well-defined at all out­side the con­text of the agent’s es­ti­ma­tor (a.k.a. world-model). How can we op­ti­mize for the agent’s “true” util­ity, not just an es­ti­mate, when the agent’s util­ity func­tion is defined as a func­tion of la­tents which may not cor­re­spond to any­thing out­side of the agent’s es­ti­ma­tor?

The Poin­t­ers Problem

We can now define the poin­t­ers prob­lem—not only “poin­t­ers to val­ues”, but the prob­lem of poin­t­ers more gen­er­ally. The prob­lem: what func­tions of what vari­ables (if any) in the en­vi­ron­ment and/​or an­other world-model cor­re­spond to the la­tent vari­ables in the agent’s world-model? And what does that “cor­re­spon­dence” even mean—how do we turn it into an ob­jec­tive for the AI, or some other con­crete thing out­side the agent’s own head?

Why call this the “poin­t­ers” prob­lem? Well, let’s take the agent’s per­spec­tive, and think about what its al­gorithm feels like from the in­side. From in­side the agent’s mind, it doesn’t feel like those la­tent vari­ables are la­tent vari­ables in a model. It feels like those la­tent vari­ables are real things out in the world which the agent can learn about. The la­tent vari­ables feel like “poin­t­ers” to real-world ob­jects and their prop­er­ties. But what are the refer­ents of these poin­t­ers? What are the real-world things (if any) to which they’re point­ing? That’s the poin­t­ers prob­lem.

Is it even solv­able? Definitely not always—there prob­a­bly is no real-world refer­ent for e.g. the hu­man con­cept of a ghost. Similarly, I can have a con­cept of a per­pet­ual mo­tion ma­chine, de­spite the likely-im­pos­si­bil­ity of any such thing ex­ist­ing. Between ab­strac­tion and lazy eval­u­a­tion, la­tent vari­ables in an agent’s world-model may not cor­re­spond to any­thing in the world.

That said, it sure seems like at least some la­tent vari­ables do cor­re­spond to struc­tures in the world. The con­cept of “tree” points to a pat­tern which oc­curs in many places on Earth. Even an alien or AI with rad­i­cally differ­ent world-model could rec­og­nize that re­peat­ing pat­tern, re­al­ize that ex­am­in­ing one tree prob­a­bly yields in­for­ma­tion about other trees, etc. The pat­tern has pre­dic­tive power, and pre­dic­tive power is not just a fig­ment of the agent’s world-model.

So we’d like to know both (a) when a la­tent vari­able cor­re­sponds to some­thing in the world (or an­other world model) at all, and (b) what it cor­re­sponds to. We’d like to solve this in a way which (prob­a­bly among other use-cases) lets the AI treat the things-cor­re­spond­ing-to-la­tents as the in­puts to the util­ity func­tion it’s sup­posed to learn and op­ti­mize.

To the ex­tent that hu­man val­ues are a func­tion of la­tent vari­ables in hu­mans’ world-mod­els, this seems like a nec­es­sary step not only for an AI to learn hu­man val­ues, but even just to define what it means for an AI to learn hu­man val­ues. What does it mean to “learn” a func­tion of some other agent’s la­tent vari­ables, with­out nec­es­sar­ily adopt­ing that agent’s world-model? If the AI doesn’t have some no­tion of what the other agent’s la­tent vari­ables even “are”, then it’s not mean­ingful to learn a func­tion of those vari­ables. It would be like an AI “learn­ing” to imi­tate grep, but with­out hav­ing any ac­cess to string or text data, and with­out the AI it­self hav­ing any in­ter­face which would ac­cept strings or text.

Poin­ter-Re­lated Maladies

Let’s look at some ex­am­ple symp­toms which can arise from failure to solve spe­cific as­pects of the poin­t­ers prob­lem.

Geno­cide Un­der-The-Radar

Let’s go back to the open­ing ex­am­ple: an AI shows us pic­tures from differ­ent pos­si­ble wor­lds and asks us to rank them. The AI doesn’t re­ally un­der­stand yet what things we care about, so it doesn’t in­ten­tion­ally draw our at­ten­tion to cer­tain things a hu­man might con­sider rele­vant—like mass graves. Maybe we see a few mass-grave pic­tures from some pos­si­ble wor­lds (prob­a­bly in pic­tures from news sources, since that’s how such in­for­ma­tion mostly spreads), and we rank those low, but there are many other wor­lds where we just don’t no­tice the prob­lem from the pic­tures the AI shows us. In the end, the AI de­cides that we mostly care about avoid­ing wor­lds where mass graves ap­pear in the news—i.e. we pre­fer that mass kil­lings stay un­der the radar.

How does this failure fit in our util­ity-func­tion-of-la­tents pic­ture?

This is mainly a failure to dis­t­in­guish be­tween the agent’s es­ti­mate of its own util­ity , and the “real” value of the agent’s util­ity (in­so­far as such a thing ex­ists). The AI op­ti­mizes for our es­ti­mate, but does not give us enough data to very ac­cu­rately es­ti­mate our util­ity in each world—in­deed, it’s un­likely that a hu­man could even han­dle that much in­for­ma­tion. So, it ends up op­ti­miz­ing for fac­tors which bias our es­ti­mate—e.g. the availa­bil­ity of in­for­ma­tion about bad things.

Note that this in­tu­itive ex­pla­na­tion as­sumes a solu­tion to the poin­t­ers prob­lem: it only makes sense to the ex­tent that there’s a “real” value of from which the “es­ti­mate” can di­verge.

Not-So-Easy Wire­head­ing Problems

The un­der-the-radar geno­cide prob­lem looks roughly like a typ­i­cal wire­head­ing prob­lem, so we should try a roughly-typ­i­cal wire­head­ing solu­tion: rather than the AI show­ing world-pic­tures, it should just tell us what ac­tions it could take, and ask us to rank ac­tions di­rectly.

If we were ideal Bayesian rea­son­ers with ac­cu­rate world mod­els and in­finite com­pute, and knew ex­actly where the AI’s ac­tions fit in our world model, then this might work. Un­for­tu­nately, the failure of any of those as­sump­tions breaks the ap­proach:

  • We don’t have the pro­cess­ing power to pre­dict all the im­pacts of the AI’s actions

  • Our world mod­els may not be ac­cu­rate enough to cor­rectly pre­dict the im­pact of the AI’s ac­tions, even if we had enough pro­cess­ing power

  • The AI’s ac­tions may not even fit neatly into our world model—e.g. even the idea of ge­netic en­g­ineer­ing might not fit the world-model of pre­mod­ern hu­man thinkers

Math­e­mat­i­cally, we’re try­ing to op­ti­mize , i.e. op­ti­mize ex­pected util­ity given the AI’s ac­tions. Note that this is nec­es­sar­ily an ex­pec­ta­tion un­der the hu­man’s model, since that’s the only con­text in which is well-defined. In or­der for that to work out well, we need to be able to fully eval­u­ate that es­ti­mate (suffi­cient pro­cess­ing power), we need the es­ti­mate to be ac­cu­rate (suffi­cient pre­dic­tive power), and we need to be defined within the model in the first place.

The ques­tion of whether our world-mod­els are suffi­ciently ac­cu­rate is par­tic­u­larly hairy here, since ac­cu­racy is usu­ally only defined in terms of how well we es­ti­mate our sense-data. But the ac­cu­racy we care about here is how well we “es­ti­mate” the val­ues of la­tent vari­ables and . What does that even mean, when the la­tent vari­ables may not cor­re­spond to any­thing in the world?

Peo­ple I Will Never Meet

“Hu­man val­ues can­not be de­ter­mined from hu­man be­hav­ior” seems al­most old-hat at this point, but it’s worth tak­ing a mo­ment to high­light just how un­der­de­ter­mined val­ues are from be­hav­ior. It’s not just that hu­mans have bi­ases of one kind or an­other, or that re­vealed prefer­ences di­verge from stated prefer­ences. Even in our perfect Bayesian util­ity-max­i­mizer, util­ity is severely un­der­de­ter­mined from be­hav­ior, be­cause the agent does not have perfect es­ti­mates of its la­tent vari­ables. Be­hav­ior de­pends only on the agent’s es­ti­mate, so it can­not ac­count for “er­ror” in the agent’s es­ti­mates of la­tent vari­able val­ues, nor can it tell us about how the agent val­ues vari­ables which are not cou­pled to its own choices.

The hap­piness of peo­ple I will never in­ter­act with is a good ex­am­ple of this. There may be peo­ple in the world whose hap­piness will not ever be sig­nifi­cantly in­fluenced by my choices. Pre­sum­ably, then, my choices can­not tell us about how much I value such peo­ples’ hap­piness. And yet, I do value it.

“Misspeci­fied” Models

In La­tent Vari­ables and Model Misspeci­fi­ca­tion, jstein­hardt talks about “mis­speci­fi­ca­tion” of la­tent vari­ables in the AI’s model. His ar­gu­ment is that things like the “value func­tion” are la­tent vari­ables in the AI’s world-model, and are there­fore po­ten­tially very sen­si­tive to mis­speci­fi­ca­tion of the AI’s model.

In fact, I think the prob­lem is more se­vere than that.

The value func­tion’s in­puts are la­tent vari­ables in the hu­man’s model, and are there­fore sen­si­tive to mis­speci­fi­ca­tion in the hu­man’s model. If the hu­man’s model does not match re­al­ity well, then their la­tent vari­ables will be some­thing wonky and not cor­re­spond to any­thing in the world. And AI de­sign­ers do not get to pick the hu­man’s model. Th­ese wonky vari­ables, not cor­re­spond­ing to any­thing in the world, are a baked-in part of the prob­lem, un­avoid­able even in prin­ci­ple. Even if the AI’s world model were “perfectly speci­fied”, it would ei­ther be a bad rep­re­sen­ta­tion of the world (in which case pre­dic­tive power be­comes an is­sue) or a bad rep­re­sen­ta­tion of the hu­man’s model (in which case those wonky la­tents aren’t defined).

The AI can’t model the world well with the hu­man’s model, but the la­tents on which hu­man val­ues de­pend aren’t well-defined out­side the hu­man’s model. Rock and a hard place.


Within the con­text of a Bayesian util­ity-max­i­mizer (rep­re­sent­ing a hu­man), util­ity/​val­ues are a func­tion of la­tent vari­ables in the agent’s model. That’s a prob­lem, be­cause those la­tent vari­ables do not nec­es­sar­ily cor­re­spond to any­thing in the en­vi­ron­ment, and even when they do, we don’t have a good way to say what they cor­re­spond to.

So, an AI try­ing to help the agent is stuck: if the AI uses the hu­man’s world-model, then it may just be wrong out­right (in pre­dic­tive terms). But if the AI doesn’t use the hu­man’s world-model, then the la­tents on which the util­ity func­tion de­pends may not be defined at all.

Thus, the poin­t­ers prob­lem, in the Bayesian con­text: figure out which things in the world (if any) cor­re­spond to the la­tent vari­ables in a model. What do la­tent vari­ables in my model “point to” in the real world?