# IRL in General Environments

Here is a pro­posal for In­verse Re­in­force­ment Learn­ing in Gen­eral En­vi­ron­ments. (2 12 pages; very lit­tle math).

Copy­ing the in­tro­duc­tion here:

The even­tual aim of IRL is to un­der­stand hu­man goals. How­ever, typ­i­cal al­gorithms for IRL as­sume the en­vi­ron­ment is finite-state Markov, and it is of­ten left un­speci­fied how raw ob­ser­va­tional data would be con­verted into a record of hu­man ac­tions, alongside the space of ac­tions available. For IRL to learn hu­man goals, the AI has to con­sider gen­eral en­vi­ron­ments, and it has to have a way of iden­ti­fy­ing hu­man ac­tions. Lest these ex­ten­sions ap­pear triv­ial, I con­sider one of the sim­plest pro­pos­als, and dis­cuss some difficul­ties that might arise.

• My main point is that IRL, as it is typ­i­cally de­scribed, feels nearly com­plete: just throw in a more ad­vanced RL al­gorithm as a sub­rou­tine and some nar­row-AI-type add-on for iden­ti­fy­ing hu­man ac­tions from a video feed, and voila, we have a su­per­hu­man hu­man helper.
[...]
But maybe we could be spend­ing more effort try­ing to fol­low through to fully speci­fied pro­pos­als which we can prop­erly put through the gaunt­let.

Re­gard­less of whether it is in­tended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly com­plete”. I ex­pect that re­searchers who have been at CHAI for at least a year do not think that IRL is “nearly com­plete”. I wrote a se­quence partly for the pur­pose of tel­ling ev­ery­one “No, re­ally, we don’t think that we just need to run IRL to get the one true util­ity func­tion; we aren’t even in­ves­ti­gat­ing that plan”.

(Sorry, this shouldn’t be di­rected just at you in par­tic­u­lar. I’m an­noyed at how of­ten I have to ar­gue against this per­cep­tion, and this pa­per hap­pened to prompt me to ac­tu­ally write some­thing.)

Also, I don’t agree that “see if an AIXI-like agent would be al­igned” is the cor­rect “gaunt­let” to be think­ing about; that kind of al­ign­ment seems doomed to me, but in any case the AI sys­tems we ac­tu­ally build are not go­ing to look any­thing like that.

• Re­gard­less of whether it is in­tended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly com­plete”. I ex­pect that re­searchers who have been at CHAI for at least a year do not think that IRL is “nearly com­plete”. I wrote a se­quence partly for the pur­pose of tel­ling ev­ery­one “No, re­ally, we don’t think that we just need to run IRL to get the one true util­ity func­tion; we aren’t even in­ves­ti­gat­ing that plan”.

I think Stu­art Rus­sell still gives this im­pres­sion in his (many) ar­ti­cles and in­ter­views. I re­mem­ber get­ting this im­pres­sion listen­ing to a re­cent in­ter­view, but will quote this Nov 2018 ar­ti­cle in­stead since many of his in­ter­views don’t have tran­scripts:

Machines are benefi­cial to the ex­tent that their ac­tions can be ex­pected to achieve our ob­jec­tives [...]

It turns out, how­ever, that it is pos­si­ble to define a math­e­mat­i­cal frame­work lead­ing to ma­chines that are prov­ably benefi­cial in this sense. That is, we define a for­mal prob­lem for ma­chines to solve, and, if they solve it, they are guaran­teed to be benefi­cial to us. In its sim­plest form, it goes like this:

• The world con­tains a hu­man and a ma­chine.

• The hu­man has prefer­ences about the fu­ture and acts (roughly) in ac­cor­dance with them.

• The ma­chine’s ob­jec­tive is to op­ti­mise for those prefer­ences.

• The ma­chine is ex­plic­itly un­cer­tain as to what they are. [...]

There are two pri­mary sources of difficulty that we are work­ing on right now: satis­fy­ing the prefer­ences of many hu­mans and un­der­stand­ing the prefer­ences of real hu­mans. [...]

Machines will need to “in­vert” ac­tual hu­man be­havi­our to learn the un­der­ly­ing prefer­ences that drive it.

Does this not sound like a plan of run­ning (C)IRL to get the one true util­ity func­tion?

• Does this not sound like a plan of run­ning (C)IRL to get the one true util­ity func­tion?

I do not think that is ac­tu­ally his plan, but I agree it sounds like it. One caveat is that I think the un­cer­tainty over prefer­ences/​re­wards is key to this story, which is a bit differ­ent from get­ting a sin­gle true util­ity func­tion.

But re­ally my an­swer is, the in­fer­en­tial dis­tance be­tween Stu­art and the typ­i­cal reader of this fo­rum is very large. (The in­fer­en­tial dis­tance be­tween Stu­art and me is very large.) I sus­pect he has very differ­ent em­piri­cal be­liefs, such that you could rea­son­ably say that he’s work­ing on a “differ­ent prob­lem”, in the same way that MIRI and I work on rad­i­cally differ­ent stuff mostly due to differ­ent em­piri­cal be­liefs.

• But re­ally my an­swer is, the in­fer­en­tial dis­tance be­tween Stu­art and the typ­i­cal reader of this fo­rum is very large. (The in­fer­en­tial dis­tance be­tween Stu­art and me is very large.)

I would be in­ter­ested to bet­ter un­der­stand Stu­art Rus­sell’s per­spec­tive. What would you recom­mend that I read or watch in or­der to do that?

I sus­pect he has very differ­ent em­piri­cal be­liefs, such that you could rea­son­ably say that he’s work­ing on a “differ­ent prob­lem”, in the same way that MIRI and I work on rad­i­cally differ­ent stuff mostly due to differ­ent em­piri­cal be­liefs.

How many “differ­ent prob­lems” would you say that peo­ple at CHAI are work­ing on? (Are there more be­sides yours and Rus­sell’s?) How many peo­ple are work­ing on each “differ­ent prob­lem”?

• I would be in­ter­ested to bet­ter un­der­stand Stu­art Rus­sell’s per­spec­tive. What would you recom­mend that I read or watch in or­der to do that?

Sadly I don’t have any recom­men­da­tions.

How many “differ­ent prob­lems” would you say that peo­ple at CHAI are work­ing on? (Are there more be­sides yours and Rus­sell’s?) How many peo­ple are work­ing on each “differ­ent prob­lem”?

That’s… hard to an­swer. I feel like most grad­u­ate stu­dents at CHAI have a some­what differ­ent opinion of what causes AI risk /​ what needs to be done to solve it, such that ev­ery­one is work­ing on a “differ­ent prob­lem”. So re­ally I should be try­ing to quan­tify how differ­ent they are… but that seems hard to do.

To be clear, I think we all ba­si­cally agree on high-level as­pects, for ex­am­ple that it would be wor­ry­ing if we had a very in­tel­li­gent agent that we couldn’t un­der­stand, or that a true ex­pected util­ity max­i­mizer with some sim­ple util­ity func­tion would likely have con­ver­gent in­stru­men­tal sub­goals.

• Sadly I don’t have any recom­men­da­tions.

This seems like a strange state of af­fairs. If he thinks there’s an im­por­tant prob­lem to be solved, and he has a unique per­spec­tive on what solv­ing that prob­lem in­volves, why hasn’t he pro­duced a pa­per or blog post or talk to ex­plain what that per­spec­tive is? Is he ex­pect­ing to solve the prob­lem all by him­self? Can you share your model of what’s go­ing on?

That’s… hard to an­swer. I feel like most grad­u­ate stu­dents at CHAI have a some­what differ­ent opinion of what causes AI risk /​ what needs to be done to solve it, such that ev­ery­one is work­ing on a “differ­ent prob­lem”.

Same ques­tion here. Aside from your­self, the other CHAI grad stu­dents don’t seem to have writ­ten up their per­spec­tives of what needs to be done about AI risk. Are they con­tent to just each work on their own ver­sion of the prob­lem? Are they try­ing to work out among them­selves which “differ­ent prob­lem” is the real one?

Maybe one rea­son to not write up one’s own “differ­ent prob­lem” is that one doesn’t ex­pect to be able to con­vince any­one else to work on it or to re­ceive use­ful feed­back. If that’s the main rea­son, I ar­gue that it’s still im­por­tant to write it up in or­der to provide in­for­ma­tion to fun­ders, strate­gists and policy mak­ers about how much dis­agree­ment there is among AI safety re­searchers, and how much re­sources are need to “cover all the bases” in tech­ni­cal AI safety re­search. If this seems like a rea­son­able ar­gu­ment, maybe you could help con­vey it to your pro­fes­sors and fel­low stu­dents?

• This seems like a strange state of af­fairs. If he thinks there’s an im­por­tant prob­lem to be solved, and he has a unique per­spec­tive on what solv­ing that prob­lem in­volves, why hasn’t he pro­duced a pa­per or blog post or talk to ex­plain what that per­spec­tive is? Is he ex­pect­ing to solve the prob­lem all by him­self? Can you share your model of what’s go­ing on?

I mean, he has, see Re­search Pri­ori­ties for Ro­bust and Benefi­cial Ar­tifi­cial In­tel­li­gence, and the ar­ti­cles you quote. What he hasn’t done is a) read the coun­ter­ar­gu­ments from LessWrongers and b) re­sponded to those coun­ter­ar­gu­ments in par­tic­u­lar. When I say I don’t have any recom­men­da­tions, I mean I don’t have any recom­men­da­tions of writ­ing that give re­sponses to typ­i­cal LessWrong coun­ter­ar­gu­ments.

My model is very sim­ple—he’s very busy and LessWrongers are at best a small frac­tion of the peo­ple he’s try­ing to co­or­di­nate with, so writ­ing up a re­sponse is not worth his time.

For a per­haps eas­ier-to-re­late-to ex­am­ple, this is ap­prox­i­mately my model for why Eliezer doesn’t re­spond to cri­tiques of his ar­gu­ments (1, 2).

Another ex­am­ple: the ac­tual view I wanted to get across with the Value Learn­ing se­quence is Chap­ter 3. Chap­ters 1 and 2, and parts of Chap­ter 3, were pri­mar­ily writ­ten in an­ti­ci­pa­tion of coun­ter­ar­gu­ments from LessWrongers, and made the Value Learn­ing se­quence re­quire sig­nifi­cantly more effort on my part.

Same ques­tion here. Aside from your­self, the other CHAI grad stu­dents don’t seem to have writ­ten up their per­spec­tives of what needs to be done about AI risk. Are they con­tent to just each work on their own ver­sion of the prob­lem? Are they try­ing to work out among them­selves which “differ­ent prob­lem” is the real one?

There is Mechanis­tic Trans­parency. But over­all I agree that there aren’t many such write­ups. I think there’s a com­bi­na­tion of fac­tors:

• Ex­pect­ing a failure to com­mu­ni­cate. For ex­am­ple, af­ter I wrote the Value Learn­ing se­quence, one of the grad stu­dents told me that they learned some­thing from it, be­cause it pin­pointed the rea­son why the ar­gu­ment “the AGI must have a util­ity func­tion” didn’t work—they already knew that the ar­gu­ment was sketchy, but they couldn’t point at a par­tic­u­lar flaw be­fore. If they had tried to write about the rea­sons for their choice of re­search, de­pend­ing on how it was writ­ten I’d ex­pect the re­sponse from LW would be “but none of this mat­ters; su­per­in­tel­li­gent AI will be an ex­pected util­ity max­i­mizer”, and the dis­cus­sion would stall.

• Many in­tu­itions about what re­search is use­ful to do are not easy to ex­press ex­plic­itly. It’s very pos­si­ble to think that a par­tic­u­lar area is worth in­ves­ti­gat­ing, with­out be­ing able to ex­plain ex­actly why you think it is worth in­ves­ti­gat­ing.

• Some are prob­a­bly still try­ing to figure out what they do /​ don’t be­lieve about AI safety, and so are work­ing on things that other peo­ple think are im­por­tant.

• Ryan’s point be­low that writ­ing blog posts on LW is not great for ca­reer cap­i­tal.

• I’ve also pre­vi­ously sent you an email about why peo­ple at CHAI don’t use the Align­ment Fo­rum as much; many of those rea­sons will ap­ply. (Not copy­ing them here be­cause I didn’t ask them for per­mis­sion to post pub­li­cly.)

• I mean, he has, see Re­search Pri­ori­ties for Ro­bust and Benefi­cial Ar­tifi­cial In­tel­li­gence,

Thanks for this refer­ence, but it’s co-au­thored with Daniel Dewey and Max Teg­mark and seems to serve as an overview of AI safety re­search agen­das that ex­isted in 2015 rather than Stu­art Rus­sell’s per­sonal re­search pri­ori­ties. (It ac­tu­ally seems to cite MIRI and Bostrom more than any­one else.)

and the ar­ti­cles you quote.

The ones I looked at all seemed to be writ­ten at a very high level for a gen­eral (not even ML/​AI re­searchers) au­di­ence (and as you noted seem to be overly sim­plified com­pared to his ac­tual views). What is the best refer­ence for ex­plain­ing his per­sonal view of AI risk/​safety? I’m happy to read some­thing that’s writ­ten for a non-LW re­search au­di­ence.

(EDIT: Re­moved part about grad stu­dents, as it seems more un­der­stand­able at this point for them to not have writ­ten up their views yet.)

• seems to serve as an overview of AI safety re­search agen­das that ex­isted in 2015 rather than Stu­art Rus­sell’s per­sonal re­search pri­ori­ties.

Fair point (I just skimmed it again, I last read it over a year ago). In that case I don’t think there is such a refer­ence, which I agree is con­fus­ing. He is work­ing on a book about AI safety that is sup­posed to be pub­lished soon, but I don’t know any de­tails about it.

• Aside from your­self, the other CHAI grad stu­dents don’t seem to have writ­ten up their per­spec­tives of what needs to be done about AI risk. Are they con­tent to just each work on their own ver­sion of the prob­lem?

I think this is ac­tu­ally pretty strate­gi­cally rea­son­able.

CHAI stu­dents would have high re­turns to their prob­a­bil­ity of at­tain­ing a top pro­fes­sor­ship by writ­ing pa­pers, which is quite benefi­cial for later re­cruit­ing top tal­ent to work on AI safety, and quite struc­turally benefi­cial for the es­tab­lish­ment of AI safety as a field of re­search. The time they might spend writ­ing up their re­search strat­egy does not help with their this, nor with re­cruit­ing help with their line of work (be­cause other nearby re­searchers face similar pres­sures, and be­cause academia is not struc­tured to have PhD stu­dents lead large teams).

More­over, if they are pur­su­ing aca­demic suc­cess, they face strong in­cen­tives to work on par­tic­u­lar prob­lems, and so their re­search strate­gies may be some­what dis­torted by these in­cen­tives, de­creas­ing the qual­ity of a re­search agenda writ­ten in that con­text.

When I look at CHAI re­search stu­dents, I see some pur­su­ing IRL, some pur­su­ing game the­ory, some pur­su­ing the re­search ar­eas of their su­per­vi­sors (all of which could lead to pro­fes­sor­ships), and some pur­su­ing pro­jects of other re­search lead­ers like MIRI or Paul. This seems healthy to me.

• I’m sorry it sounded like a dig at CHAI’s work, and you’re right that “typ­i­cally de­scribed” is at best a gen­er­al­iza­tion over too many peo­ple, and worst, wrong. It would be more ac­cu­rate to say that when peo­ple de­scribe IRL, I get the feel­ing that it’s nearly com­plete—I don’t think I’ve seen any­one pre­sent­ing an idea about IRL flag the con­cern that the is­sue of rec­og­niz­ing the demon­stra­tor’s ac­tion might jeop­ar­diz­ing the whole thing.

I did in­tend to cast some doubt on whether the IRL re­search agenda is promis­ing, and whether in­fer­ring a util­ity func­tion from a hu­man’s ac­tions in­stead of from a re­ward sig­nal gets us any closer to safety, but I’m sorry to have mis­rep­re­sented views. (And maybe it’s worth men­tion­ing that I’m fid­dling with some­thing that bears strong re­sem­blance to In­verse Re­ward De­sign, so I’m definitely not that bear­ish on the whole idea).

• (Sorry, this shouldn’t be di­rected just at you in par­tic­u­lar. I’m an­noyed at how of­ten I have to ar­gue against this per­cep­tion, and this pa­per hap­pened to prompt me to ac­tu­ally write some­thing.)

Seems like aside from Stu­art Rus­sell, Max Teg­mark (or who­ever gave him the fol­low­ing in­for­ma­tion) is an­other main per­son you should blame for this. I just ran across this quote from his Life 3.0 book (while look­ing for some­thing else):

A cur­rently pop­u­lar ap­proach to the sec­ond challenge is known in geek-speak as in­verse re­in­force­ment learn­ing, which is the main fo­cus of a new Berkeley re­search cen­ter that Stu­art Rus­sell has launched. [...] How­ever, a key idea un­der­ly­ing in­verse re­in­force­ment learn­ing is that we make de­ci­sions all the time, and that ev­ery de­ci­sion we make re­veals some­thing about our goals. The hope is there­fore that by ob­serv­ing lots of peo­ple in lots of situ­a­tions (ei­ther for real or in movies and books), the AI can even­tu­ally build an ac­cu­rate model of all our prefer­ences.

• Also, I don’t agree that “see if an AIXI-like agent would be al­igned” is the cor­rect “gaunt­let” to be think­ing about; that kind of al­ign­ment seems doomed to me, but in any case the AI sys­tems we ac­tu­ally build are not go­ing to look any­thing like that.

I’m go­ing to do my best to de­scribe my in­tu­itions around this.

Propo­si­tion 1: an agent will be com­pe­tent at achiev­ing goals in our en­vi­ron­ment to the ex­tent that its world-model con­verges to the truth. It doesn’t have to con­verge all the way, but the KL-di­ver­gence from the true world-model to its world-model should reach the or­der of mag­ni­tude of the KL-di­ver­gence from the true world-model to a typ­i­cal hu­man world-model.

Propo­si­tion 2: The world-model re­sult­ing from Bayesian rea­son­ing with a suffi­ciently large model class does con­verge to the truth, so from Propo­si­tion 1, any com­pe­tent agent’s world-model will con­verge as close to the Bayesian world-model as it does to the truth.

Propo­si­tion 3: If the ver­sion of an “idea” that uses Bayesian rea­son­ing (on a model class in­clud­ing the truth) is un­safe, then the kind of agent we ac­tu­ally build that is “based on that idea” will ei­ther a) not be com­pe­tent, or b) roughly ap­prox­i­mate the Bayesian ver­sion, and by de­fault, be un­safe as well (in the ab­sence of some in­ter­est­ing rea­son why a small con­fu­sion about fu­ture events will lead to a large de­pri­ori­ti­za­tion of dan­ger­ous plans).

Let­ting F be a failure mode that arises when an idea is im­ple­mented in the frame­work of Bayesian agent with a model class in­clud­ing the truth, I ex­pect in the ab­sence of ar­gu­ments oth­er­wise, that the same failure mode will ap­pear in any com­pe­tent agent which also im­ple­ments the idea in some way. How­ever, it can be much harder to spot it, so I think one of the best ways to look for pos­si­ble failure modes in the sort of AI we ac­tu­ally build is to an­a­lyze the ideal­ized ver­sion, i.e. an agent it’s ap­prox­i­mat­ing, i.e. a Bayesian agent with a model class in­clud­ing the truth. And then on the flip side, if the idea still seems to have real value when for­mal­ized in a Bayesian agent with a large model class, tractable ap­prox­i­ma­tions thereof seem (rel­a­tively) likely to work similarly well.

Maybe you can point me to­ward the steps that seem the most opaque/​fishy.

• Sorry in ad­vance for how un­helpful this is go­ing to be. I think de­com­pos­ing an agent into “goals”, “world-model”, and “plan­ning” is the wrong way to be de­com­pos­ing agents. I hope to write a post about this soon.

• No, that’s helpful. If it were the right way, do you think this rea­son­ing would ap­ply?

Edit: al­ter­na­tively, if a pro­posal does de­com­pose an agent into world-model/​goals/​plan­ning (as IRL does), does the ar­gu­ment stand that we should try to an­a­lyze the be­hav­ior of a Bayesian agent with a large model class which im­ple­ments the idea?

• … Plau­si­bly? Idk, it’s very hard for me to talk about the val­idity of in­tu­itions in an in­for­mal, in­tu­itive model that I don’t share. I don’t see any­thing ob­vi­ously wrong with it.

There’s the usual is­sue that Bayesian rea­son­ing doesn’t prop­erly ac­count for em­bed­ded­ness, but I don’t think that would make much of a differ­ence here.

• IRL to get the one true util­ity function

I think I’m un­der­stand­ing you to be con­cep­tu­al­iz­ing a di­chotomy be­tween “un­cer­tainty over a util­ity func­tion” vs. “look­ing for the one true util­ity func­tion”. (I’m also get­ting this from your com­ment be­low:

One caveat is that I think the un­cer­tainty over prefer­ences/​re­wards is key to this story, which is a bit differ­ent from get­ting a sin­gle true util­ity func­tion.

).

I can’t figure out on my own a sense in which this di­chotomy ex­ists. To be un­cer­tain about a util­ity func­tion is to be­lieve there is one cor­rect one, while en­gag­ing in the pro­cess of up­dat­ing prob­a­bil­ities about its iden­tity.

Also, for what it’s worth, in the case where there is an uniden­ti­fi­a­bil­ity prob­lem, as there is here, even in the limit, a Bayesian agent won’t con­verge to cer­tainty about a util­ity func­tion.

• I think I’m un­der­stand­ing you to be con­cep­tu­al­iz­ing a di­chotomy be­tween “un­cer­tainty over a util­ity func­tion” vs. “look­ing for the one true util­ity func­tion”.

Well, I don’t per­son­ally en­dorse this. I was spec­u­lat­ing on what might be rele­vant to Stu­art’s un­der­stand­ing of the prob­lem.

I was try­ing to point to­wards the di­chotomy be­tween “act­ing while hav­ing un­cer­tainty over a util­ity func­tion” vs. “act­ing with a known, cer­tain util­ity func­tion” (see e.g. The Off-Switch Game). I do know about the prob­lem of fully up­dated defer­ence and I don’t know what Stu­art thinks about it.

Also, for what it’s worth, in the case where there is an uniden­ti­fi­a­bil­ity prob­lem, as there is here, even in the limit, a Bayesian agent won’t con­verge to cer­tainty about a util­ity func­tion.

Agreed, but I’m not sure why that’s rele­vant. Why do you need cer­tainty about the util­ity func­tion, if you have cer­tainty about the policy?

• Okay maybe we don’t dis­agree on any­thing. I was try­ing to make differ­ent point with the uniden­ti­fi­a­bil­ity prob­lem, but it was tan­gen­tial to be­gin with, so never mind.

• A good start­ing point. I’m re­minded of an old Kaj So­tala post (which then later pro­vided in­spira­tion for me writ­ing a sort of similar post) about try­ing to en­sure that the AI has hu­man-like con­cepts. If the AI’s con­cepts are in­hu­man, then it will gen­er­al­ize in an in­hu­man way, so that some­thing like teach­ing a policy though demon­stra­tions might not work.

But of course hav­ing hu­man-like con­cepts is tricky and be­yond the scope of vanilla IRL.