Learning preferences by looking at the world

Link post

We’ve writ­ten up a blog post about our re­cent pa­per that I’ve been link­ing to but haven’t re­ally an­nounced or ex­plained. The key idea is that since we’ve op­ti­mized the world to­wards our prefer­ences, we can in­fer these prefer­ences just from the state of the world. We pre­sent an al­gorithm called Re­ward Learn­ing by Si­mu­lat­ing the Past (RLSP) that can do this in sim­ple en­vi­ron­ments, but my pri­mary goal is sim­ply to show that there is a lot to be gained by in­fer­ring prefer­ences from the world state.

The rest of this post as­sumes that you’ve read at least the non-tech­ni­cal part of the linked blog post. This post is en­tirely my own and may not re­flect the views of my coau­thors.

Other sources of intuition

The story in the blog post is that when you look at the state of the world, you can figure out what hu­mans have put effort into, and thus what they care about. There are other in­tu­ition pumps that you can use as well:

  • The world state is “sur­pris­ingly” or­dered and low-en­tropy. Any­where you see such or­der, you can bet that a hu­man was re­spon­si­ble for it, and that the hu­man cared about it.

  • If you look across the world, you’ll see many pat­terns re­cur­ring again and again—vases are usu­ally in­tact, glasses are usu­ally up­right, and lap­tops are usu­ally on desks. Pat­terns that wouldn’t have hap­pened with­out hu­mans are likely some­thing hu­mans care about.

How can a sin­gle state do so much?

You might be won­der­ing how a sin­gle state could pos­si­bly con­tain so much in­for­ma­tion. And you would be cor­rect to won­der that. This method de­pends very cru­cially on the as­sump­tion of known dy­nam­ics (i.e. a model of “how the world works”) and a good fea­tur­iza­tion.

Known dy­nam­ics. This is what al­lows you to simu­late the past, and figure out what “must have hap­pened”. Us­ing the dy­nam­ics, the robot can figure out that break­ing a vase is ir­re­versible, and that Alice must have taken spe­cial care to avoid do­ing so. This is also what al­lows us to dis­t­in­guish be­tween effects caused by hu­mans (which we care about) and effects caused by the en­vi­ron­ment (which we don’t care about).

If you take away the knowl­edge of dy­nam­ics, much of the oomph of this method is gone. You could still look for and pre­serve rep­e­ti­tions in the state—maybe there are a lot of in­tact vases and no bro­ken vases, so you try to keep vases in­tact. But this might also lead you to mak­ing sure that no­body puts warn­ing signs near cliffs, since most cliffs don’t have warn­ing signs near them.

But no­tice that dy­nam­ics are an em­piri­cal fact about the world, and do not de­pend on “val­ues”. We should ex­pect pow­er­ful AI sys­tems to have a good un­der­stand­ing of dy­nam­ics. So I’m not too wor­ried about the fact that we need to know dy­nam­ics for this to work well.

Fea­tures. A good fea­tur­iza­tion on the other hand al­lows you to fo­cus on re­ward func­tions that are “rea­son­able” or “about the im­por­tant parts”. It elimi­nates a vast swathe of strange, im­plau­si­ble re­ward func­tions that you oth­er­wise would not be able to elimi­nate. If you didn’t have a good fea­tur­iza­tion and in­stead had re­wards that were any func­tion map­ping from states to re­wards, then you would typ­i­cally learn some de­gen­er­ate re­ward, such as map­ping to re­ward 1 and map­ping ev­ery­thing else to re­ward 0. (IRL faces the same prob­lem of de­gen­er­ate re­wards. Since we ob­serve strictly less than IRL does, we face the same prob­lem.)

I’m not sure whether fea­tures are more like em­piri­cal facts, or more like val­ues. It sure seems like there are very nat­u­ral ways to un­der­stand the world that im­ply a cer­tain set of fea­tures, and that a pow­er­ful AI sys­tem is likely to have these fea­tures; but maybe it only feels this way be­cause we hu­mans ac­tu­ally use those fea­tures to un­der­stand the world. I hope to test this in fu­ture work by try­ing out RLSP-like al­gorithms in more re­al­is­tic en­vi­ron­ments where we first learn fea­tures in an un­su­per­vised man­ner.

Con­nec­tion to im­pact measures

Prefer­ences in­ferred from the state of the world are kind of like im­pact mea­sures in that they al­low us to in­fer all of the “com­mon sense” rules that hu­mans fol­low that tell us what not to do. The origi­nal mo­ti­vat­ing ex­am­ple for this work was a more com­pli­cated ver­sion of the vase en­vi­ron­ment, which is the stan­dard ex­am­ple for nega­tive side effects. (It was more com­pli­cated be­cause at the time I thought it was im­por­tant for there to be “rep­e­ti­tions” in the en­vi­ron­ment, e.g. mul­ti­ple in­tact vases.)

Desider­ata. I think that there are three desider­ata for im­pact mea­sures that are very hard to meet in con­cert. Let us say that an im­pact mea­sure must also spec­ify the set of re­ward func­tions it is com­pat­i­ble with. For ex­am­ple, at­tain­able util­ity preser­va­tion (AUP) aims to be com­pat­i­ble with re­wards whose codomain is [0, 1]. Then the desider­ata are:

  • Prevent catas­tro­phe: The im­pact mea­sure pre­vents all catas­trophic out­comes, re­gard­less of which com­pat­i­ble re­ward func­tion the AI sys­tem op­ti­mizes.

  • Do what we want: There ex­ists some com­pat­i­ble re­ward func­tion such that the AI sys­tem does the things that we want, de­spite the im­pact mea­sure.

  • Value ag­nos­tic: The de­sign of the im­pact mea­sure (both the penalty and the set of com­pat­i­ble re­wards) should be ag­nos­tic to hu­man val­ues.

Note that the first two desider­ata are about what the im­pact mea­sure ac­tu­ally does, as op­posed to what we can prove about it. The sec­ond one is an ad­di­tion I’ve ar­gued for be­fore.

With both rel­a­tive reach­a­bil­ity and AUP, I worry that any set­ting of the hy­per­pa­ram­e­ters will lead to a vi­o­la­tion of ei­ther the first desider­a­tum (if the penalty is not large enough) or the sec­ond one (if the penalty is too large). For in­ter­me­di­ate set­tings, both desider­ata would be vi­o­lated.

When we in­fer prefer­ences from the state of the world, we are definitely giv­ing up on be­ing value ag­nos­tic, but we are gain­ing sig­nifi­cantly on the “do what we want” desider­a­tum: the point of in­fer­ring prefer­ences is that we do not also pe­nal­ize pos­i­tive im­pacts that we want to hap­pen.

Test cases. You might won­der why we didn’t try us­ing RLSP on the en­vi­ron­ments in rel­a­tive reach­a­bil­ity. The main prob­lem is that those en­vi­ron­ments don’t satisfy our key as­sump­tion: that a hu­man has been act­ing to op­ti­mize their prefer­ences for some time. So if you try to run RLSP in that set­ting, it is very likely to fail. I think this is fine, be­cause RLSP is ex­ploit­ing a fact about re­al­ity that those en­vi­ron­ments fail to model.

(This is a gen­eral prob­lem with bench­marks: they of­ten do not in­clude im­por­tant as­pects of the real prob­lem un­der con­sid­er­a­tion, be­cause the bench­mark de­sign­ers didn’t re­al­ize that those as­pects were im­por­tant for a solu­tion.)

This is kind of re­lated to the fact that we are not try­ing to be value ag­nos­tic—if you’re try­ing to come up with a value ag­nos­tic, ob­jec­tive mea­sure of im­pact, then it would make sense that you could cre­ate some sim­ple grid­world en­vi­ron­ments and claim that any ob­jec­tive mea­sure of im­pact should give the same re­sult on that en­vi­ron­ment, since one ac­tion is clearly more im­pact­ful than the other. How­ever, since we’re not try­ing to be value ag­nos­tic, that ar­gu­ment doesn’t ap­ply.

If you take the test cases, put them in a more re­al­is­tic con­text, make your model of the world suffi­ciently large and pow­er­ful, don’t worry about com­pute, and imag­ine a var­i­ant of RLSP that some­how learns good fea­tures of the world, then I would ex­pect that RLSP could solve most of the im­pact mea­sure test cases.

What’s the point?

Be­fore peo­ple start point­ing out how a su­per­in­tel­li­gent AI sys­tem would game the prefer­ences learned in this way, let me be clear: the goal is not to use the in­ferred prefer­ences as a util­ity func­tion. There are many rea­sons this is a bad idea, but one ar­gu­ment is that un­less you have a good mis­take model, you can’t ex­ceed hu­man perfor­mance—which means that (for the most part) you want to leave the state the way it already is.

In other words, we are also not try­ing to achieve the “Prevent catas­tro­phe” desider­a­tum above. We are in­stead go­ing for the weaker goal of pre­vent­ing some bad out­comes, and learn­ing more of hu­man prefer­ences with­out in­creas­ing the bur­den on the hu­man over­seer.

You can also think of this as a con­tri­bu­tion to the over­all paradigm of value learn­ing: the state of the world is an es­pe­cially good source of in­for­ma­tion of our prefer­ences on what not to do, which are par­tic­u­larly hard to get feed­back on.

If I had to point to­wards a par­tic­u­lar con­crete path to a good fu­ture, it would be the one that I out­lined in Fol­low­ing hu­man norms. We build AI sys­tems that have a good un­der­stand­ing of “com­mon sense” or “how to be­have nor­mally in hu­man so­ciety”; they ac­cel­er­ate tech­nolog­i­cal de­vel­op­ment and im­prove de­ci­sion-mak­ing; if we re­ally want to have a goal-di­rected AI that is not un­der our con­trol but that op­ti­mizes for our val­ues then we solve the full al­ign­ment prob­lem in the fu­ture. In­fer­ring prefer­ences or norms from the world state could be a cru­cial part of helping our AI sys­tems un­der­stand “com­mon sense”.


There are a bunch of rea­sons why you couldn’t take RLSP, run it on the real world and hope to get a set of prefer­ences that pre­vent you from caus­ing nega­tive im­pacts. Many of these are in­ter­est­ing di­rec­tions for fu­ture work:

Things we don’t af­fect. We can’t af­fect quasars even if we wanted to, and so quasars are not op­ti­mized for our prefer­ences, and RLSP will not be able to in­fer any­thing about our prefer­ences about quasars.

We are op­ti­mized for the en­vi­ron­ment. You might re­ply that we don’t re­ally have strong prefer­ences about quasars (but don’t we?), but even then evolu­tion has op­ti­mized us to pre­fer our en­vi­ron­ment, even though we haven’t op­ti­mized it. For ex­am­ple, you could imag­ine that RLSP in­fers that we don’t care about the com­po­si­tion of the at­mo­sphere, or in­fers that we pre­fer there to be more car­bon diox­ide in the at­mo­sphere. Thanks to Daniel Filan for mak­ing this point way back at the gen­e­sis of this pro­ject.

Mul­ti­ple agents. RLSP as­sumes that there is ex­actly one hu­man act­ing in the en­vi­ron­ment; in re­al­ity there are billions, and they do not have the same prefer­ences.

Non-static prefer­ences. Or as Stu­art Arm­strong likes to put it, our val­ues are un­der­defined, change­able, and ma­nipu­la­ble, whereas RLSP as­sumes they are static.

Not ro­bust to mis­speci­fi­ca­tion and im­perfect mod­els. If you have an in­cor­rect model of the dy­nam­ics, or a bad fea­tur­iza­tion, you can get very bad re­sults. For ex­am­ple, if you can tell the differ­ence be­tween dusty vases and clean vases, but you don’t re­al­ize that by de­fault dust ac­cu­mu­lates on vases over time, then you in­fer that Alice ac­tively wants her vase to be dusty.

Us­ing finite-hori­zon policy for Alice in­stead of an in­finite-hori­zon policy. The math in RLSP as­sumes that Alice was op­ti­miz­ing her re­ward over an epi­sode that would end ex­actly when the robot is de­ployed, so that the ob­served state is Alice’s “fi­nal state”. This is clearly a bad model, since Alice will still be act­ing in the en­vi­ron­ment af­ter the robot is de­ployed. For ex­am­ple, if the robot is de­ployed the day be­fore Alice is sched­uled to move, the robot might in­fer that Alice re­ally wants there to be a lot of mov­ing boxes in her liv­ing space (rather than re­al­iz­ing that this is an in­stru­men­tal goal in a longer-term plan).

There’s no good rea­son for us­ing a finite hori­zon policy for Alice. We were sim­ply fol­low­ing Max­i­mum Causal En­tropy IRL, which makes this as­sump­tion (which is much more rea­son­able when you ob­serve demon­stra­tions rather than the state of the world), and didn’t re­al­ize our mis­take un­til we were nearly done. The finite hori­zon ver­sion worked suffi­ciently well that we didn’t redo ev­ery­thing with the in­finite hori­zon case, which would have been a sig­nifi­cant amount of work.