Let Values Drift

I oc­ca­sion­ally run across lines of rea­son­ing that de­pend on or fa­vor the po­si­tion that value drift should be avoided.

I find odd the idea of value drift, let alone the idea that value drift is bad. My in­tu­ition is that value drift is good if any­thing since it rep­re­sents an up­date of one’s val­ues based on new ev­i­dence and greater time to com­pute re­flec­tive equil­ibrium. But rather than ar­gu­ing in­tu­ition, let’s ex­plore value drift a bit be­fore we come to any stronger con­clu­sions.

(Fair warn­ing, this is go­ing to get into some deep philo­soph­i­cal ter­ri­tory, be pretty un­apolo­getic about it, and as­sume you are read­ing care­fully enough to no­tice what I say and not what you think I said. I’m still work­ing some of these ideas out my­self, so I don’t have the fluency to provide a more ac­cessible ex­pla­na­tion right now. I also take some pretty big in­fer­en­tial jumps at times that you may not be on board with as of yet, so the later parts might feel like un­jus­tified rea­son­ing. I don’t think that’s the case, but you’ll have to poke at me to help me figure out how to fill in those gaps.

In spite of all those apolo­gies, there are some key in­sights here, and I’m un­likely to get clearer un­less I am first more opaque, so please bear with me if you please, es­pe­cially if you are in­ter­ested in value as it re­lates to AI al­ign­ment.)

Whence drift­ing val­ues?

The metaphor of drift­ing val­ues is that your val­ues are ini­tially one place and then grad­u­ally re­lo­cate to an­other, like flot­sam. The waves of for­tune, chance, and in­ten­tion com­bine to de­ter­mine where they end up on the seas of change. In this metaphor, val­ues are dis­crete, iden­ti­fi­able things. Lin­guis­ti­cally, they are nouns.

When we talk of val­ues as nouns, we are talk­ing about the val­ues that peo­ple have, ex­press, find, em­brace, and so on. For ex­am­ple, a per­son might say that al­tru­ism is one of their val­ues. But what would it mean to “have” al­tru­ism as a value or for it to be one of one’s val­ues? What is the thing pos­sessed or of one in this case? Can you grab al­tru­ism and hold onto it, or find it in the mind cleanly sep­a­rated from other thoughts? As best I can tell, no, un­less con­trary to ev­i­dence and par­si­mony some­thing like Pla­tonic ideal­ism proves con­sis­tent with re­al­ity, so it seems a type er­ror to say you pos­sess al­tru­ism or any other value since val­ues are not things but ha­bit­u­a­tions or pat­terns of ac­tion (more on this in the next sec­tion). It’s only be­cause we use the metaphor of pos­ses­sion to mean some­thing like ha­bit­ual valu­ing that it can seem as if these pat­terns over our ac­tions are things in their own right.

So what, you may think, it’s just a lin­guis­tic con­ven­tion and doesn’t change what’s re­ally go­ing on. That’s both wrong and right. Yes, it’s a lin­guis­tic con­ven­tion and yes you get on with valu­ing all the same no mat­ter how you talk about it, but lin­guis­tic con­ven­tions shape our thoughts and limit our abil­ity to ex­press our­selves with the frames they provide. In the worst case, as I sus­pect is of­ten hap­pen­ing when peo­ple reckon about value drift, we can fo­cus so much on the con­ven­tion that we for­get what’s re­ally go­ing on and rea­son only about the ab­strac­tion, viz. mis­take the map for the ter­ri­tory. And since we’ve just seen that the value-as-thing ab­strac­tion is leaky be­cause it im­plies the abil­ity to pos­sess that which can­not be, it can lead us astray by al­low­ing us to op­er­ate from a false as­sump­tion about how the world works, ex­pect­ing it to func­tion one way when it ac­tu­ally op­er­ates an­other.

To my listen­ing most talk about value drift is at least par­tially if not wholly con­fused by this mis­tak­ing of val­ues for things, and mis­tak­ing them speci­fi­cally for essences. But let’s sup­pose you don’t make this mis­take; is value drift still sen­si­ble?

I think we can re­ha­bil­i­tate it, but to do that we’ll need a clearer un­der­stand­ing of “ha­bit­ual valu­ing” and “pat­terns of ac­tion”.

Valu­ing valuing

If we tear away the idea that we might pos­sess val­ues, we are left with the act of valu­ing, and to value some­thing is ul­ti­mately to judge it or as­sess its worth. While I can’t hope to fit all my philos­o­phy into this para­graph, I con­sider valu­ing, judg­ing, or as­sess­ing to be one of the fun­da­men­tal op­er­a­tions of “con­scious” things, it be­ing the key in­put that pow­ers the feed­back loops that differ­en­ti­ate the “liv­ing” from the “dead”. For his­tor­i­cal rea­sons we might call this feel­ing or sen­sa­tion, and if you like con­trol the­ory “sens­ing” seems ap­pro­pri­ate since in a con­trol sys­tem it is the sen­sor that de­ter­mines and sends the sig­nal to the con­trol­ler af­ter it senses the sys­tem. Promis­ing mod­ern the­o­ries sug­gest con­trol the­ory is use­ful for mod­el­ing the hu­man mind as a hi­er­ar­chy of con­trol sys­tems that min­i­mize pre­dic­tion er­ror while also main­tain­ing home­osta­sis, and this matches with one of the most de­tailed and longest used the­o­ries of hu­man psy­chol­ogy, so I feel jus­tified in say­ing that the key, prim­i­tive ac­tion hap­pen­ing when we value some­thing is that we sense or judge it to be good, neu­tral, or bad (or, if you pre­fer, more, same, or less).

We could get hung up on good, neu­tral, and bad, but let’s just un­der­stand them for now as rel­a­tive terms in the sense of the brain as con­trol sys­tem, where “good” sig­nals bet­ter pre­dic­tion or oth­er­wise mov­ing to­wards a set point and “bad” sig­nals worse pre­dic­tion or mov­ing away from a set point. Then in this model we could say that to value some­thing is to sense it and send a sig­nal out to the rest of the brain that it is good. Thus to “have a value” is to ob­serve a pat­tern of ac­tion that senses that pat­tern to be good. To re­turn to the ex­am­ple of valu­ing al­tru­ism, when a per­son who val­ues al­tru­ism acts in a way that pat­tern matches to al­tru­ism (maybe “benefits oth­ers” or some­thing similar), the brain senses this pat­tern to be good and feeds that sig­nal back into it­self fur­ther ha­bit­u­at­ing ac­tions that match the al­tru­ism pat­tern. It is this ha­bit­u­a­tion that we are point­ing to when we say we “have” a value.

Aside: How any in­di­vi­d­ual comes to sense any par­tic­u­lar pat­tern, like al­tru­ism, to be good, neu­tral, or bad is an in­ter­est­ing topic in and of it­self, but we don’t need that par­tic­u­lar gear to con­tinue dis­cussing value drift, so this is where the model bot­toms out for this post.

We can now un­der­stand value drift to mean changes in ha­bit­u­a­tions or pat­terns of ac­tion over time. I re­al­ize some of my read­ers will throw their hands up at this point and say “why did we have to go through all that just to get back to where we started?!?”, but the point was to un­pack value drift so we can un­der­stand it as it is, not as we think it is. And as will be­come clear in the fol­low­ing anal­y­sis, that un­pack­ing is key to un­der­stand­ing why value drift seems an odd thing to worry about to me.

Values adrift

My ex­pla­na­tion of valu­ing im­plies that val­ues-as-things are af­ter-the-fact reifi­ca­tions drawn from the ob­ser­va­tion of ac­cu­mu­lated effects of in­di­vi­d­ual ac­tions, and as such val­ues can­not them­selves di­rectly drift be­cause they are down­stream of where change hap­pens. The changes that will be­fall these reifi­ca­tions that we call “val­ues” hap­pen mo­ment to mo­ment, ac­tion to ac­tion, where each par­tic­u­lar ac­tion taken will only later be ag­gre­gated to form a pat­tern that can be ex­pressed as a value, and even then that value ex­ists only by virtue of on­tol­ogy be­cause it is an in­fer­ence from ob­ser­va­tion. Thus when val­ues “drift” it’s about as mean­ingful as say­ing the draw­ings of con­ti­nents “drift” over ge­olog­i­cal time: it’s sort of true, but only mean­ingful so long as un­der­stand­ing re­mains firmly grounded in the phe­nom­ena be­ing pointed to, and un­like with maps of ge­og­ra­phy maps of mind are more eas­ily con­fused for mind it­self.

What in­stead drifts or changes are ac­tions, al­though say­ing they drift or change is wrought be­cause it sup­poses some sta­ble view­point from which to ob­serve the change, yet ac­tions, via the prefer­ences that cause us to choose any par­tic­u­lar ac­tion over all oth­ers, are con­tin­u­ously de­pen­dent on the con­di­tions in which they arise be­cause what we sense (value, judge, as­sess) is con­di­tional on the en­tire con­text in which we do the sens­ing. So it is only out­side the mo­ment, whether be­fore or af­ter, that we judge change, and so change is also on­tolog­i­cally bound such that we can find no change if we look with­out on­tol­ogy. In this sense change and drift in ac­tions and pat­terns of ac­tion ex­ist but are not real: they are in the map, but not the base ter­ri­tory.

Does that mat­ter? I think it does, be­cause we can be con­fused about on­tol­ogy, con­fu­sion can only arise via on­tol­ogy, and sens­ing/​valu­ing is very near the root of on­tol­ogy gen­er­a­tion, so our un­der­stand­ing of what it means to value is mostly con­tam­i­nated by valu­ing it­self! Cer­tainly by the time we put words to our thoughts we have already sensed and passed judge­ment on many phe­nom­ena, and that means that when we talk about value drift we’re talk­ing from a mo­ti­vated stance where val­u­a­tion heav­ily shaped our per­spec­tives, so I find it not at all odd that valu­ing would find a way to make it­self and its prod­ucts sta­ble points within con­cept space such that it would feel nat­u­ral to worry that they might drift, and that drift­ing and change in val­ues would evap­o­rate with­out sens­ing feed­back loops to prop them up!

This is not to an­thro­po­mor­phize valu­ing, but to point out the way it is prior to and self-in­cen­tivized to mag­nify its ex­is­tence; it’s like a sub­agent car­ry­ing out its own goals re­gard­less of yours, and it’s so good at it that it’s shaped your goals be­fore you even knew you had them. And when we strip away ev­ery­thing pos­te­rior to valu­ing we find no mechanism by which value can change be­cause we can’t even con­cep­tu­al­ize change at that point, so we are left with valu­ing as a pure, mo­men­tary act that can­not drift or change be­cause it has no frame to drift or change within. So when I say value drift is odd to me this is what I mean: it’s ex­ists as a func­tion of valu­ing, not of valu­ing it­self, and we can find no place where value change oc­curs that is not tainted by the eval­u­a­tions of sens­ing.

(Care­ful read­ers will note this is analo­gous to the episte­molog­i­cal prob­lem that ne­ces­si­tates a leap of faith when knowl­edge is un­der­stood on­tolog­i­cally.)

Yikes! So what do we do?

Steady on

The ques­tions that mo­ti­vate this in­ves­ti­ga­tion are ones like “how do we pro­tect effec­tive al­tru­ists (EAs) from value drift so that they re­main al­tru­is­tic later in life and don’t re­vert to the mean?” and “how do we al­ign su­per­in­tel­li­gent AI with hu­man val­ues such that they stay al­igned with hu­man val­ues even as they think longer and more deeply than any hu­man could?”. Even if I lost you in the pre­vi­ous sec­tion—and I’m a lit­tle bit lost in my own rea­son­ing if I’m to­tally hon­est—how can we cash out all this philos­o­phy into in­for­ma­tion rele­vant to these ques­tions?

In the case of drift­ing EAs, I say let them drift. They value EA be­cause con­di­tions in their lives caused them to value it, and if those con­di­tions change so be it. Most peo­ple lack the agency to stay firm in the face of chang­ing con­di­tions, I think this is mostly a safety mechanism to pro­tect them from over­com­mit­ting when they aren’t epistem­i­cally ma­ture enough to know what they’re do­ing, and for ev­ery EA lost to this there will likely be an­other EA gained, so we don’t have to worry about it much other than to deal with churn effects on the least com­mit­ted mem­bers of the move­ment. To do oth­er­wise is to be in­con­sis­tent on re­spect­ing meta-prefer­ences, as­sum­ing you think we should re­spect peo­ple’s meta-prefer­ences, in this case speci­fi­cally the meta-prefer­ence for au­ton­omy of be­liefs and ac­tions. Just like you would prob­a­bly find it trou­bling to find racists or fas­cists or some other out­group work­ing on in­cen­tives to keep peo­ple racist or fas­cist in the face of ev­i­dence that they should change, you should find it trou­bling that we would seek to ma­nipu­late in­cen­tives such that peo­ple are more likely to con­tinue to hold EA be­liefs in the face of con­trary ev­i­dence.

Most of this ar­gu­ment is aside my main point that value drift is a sub­tly mo­ti­vated fram­ing to keep val­ues sta­ble prop­a­gated by the very feed­back pro­cesses that use sense sig­nals as in­put with no prior man­i­fes­ta­tion to fall back on, but you might be able to see the deep veins of it run­ning through. More rele­vant to this ques­tion di­rectly are prob­a­bly things like “Yes Re­quires the Pos­si­bil­ity of No”, “Fun­da­men­tal Value Differ­ences are not that Fun­da­men­tal”, “Archipelago”, and much about meta-con­sis­tency in ethics that’s not salient to me at this time.

On the ques­tion of AI al­ign­ment, this sug­gests con­cerns about value drift are at least par­tially about con­fu­sion on val­ues and par­tially fear born of a de­sire for value self-preser­va­tion. That is, a prefer­ence to avoid value drift in su­per­in­tel­li­gent AIs may not be a prin­ci­pled stance, or may be prin­ci­pled but grounded in fear of change and noth­ing more. This is not to say we hu­mans would be happy with any sense ex­pe­riences, only that we are bi­ased and an­chored on our cur­rent sens­ing (valu­ing) when we think about how we might sense things other than we do now un­der other con­di­tions. I re­al­ize this makes the al­ign­ment prob­lem harder if you were hop­ing to train against cur­rent hu­man val­ues and then stick near them, and maybe that’s still a good plan be­cause al­though it’s con­ser­va­tive and risks as­tro­nom­i­cal waste by deny­ing us ac­cess to full op­ti­miza­tion of valu­ing, that’s prob­a­bly bet­ter than at­tempt­ing and failing at a more di­rect ap­proach that is less waste­ful but maybe also ends up tiling the uni­verse with smiley faces. My con­cern is that if we take the more con­ser­va­tive ap­proach, we might fail any­way be­cause the value ab­strac­tion is leaky and we end up build­ing agents that op­ti­mize for the wrong things, leav­ing gaps through which x-risks de­velop any­way.

(Un­less it wasn’t clear, AI al­ign­ment is hard.)

If any of that left you more con­fused than when you started read­ing this, then good, mis­sion ac­com­plished. I con­tinue to be con­fused about val­ues my­self, and this is part of a pro­gram of try­ing to see through them and be­come de­con­fused on them, similar to the way I had to de­con­fuse my­self on moral­ity many years ago. Un­for­tu­nately not many peo­ple are de­con­fused on val­ues (rel­a­tively more are de­con­fused on morals) so not much is writ­ten to guide me along. Look for the next post when­ever I get more de­con­fused enough to have more to say.

Cross-posted to Map and Ter­ri­tory on Medium