Learning Values in Practice

(This talk was given at a pub­lic on­line event on Sun­day July 12th. Stu­art Arm­strong is re­spon­si­ble for the talk, Ja­cob Lager­ros and Justis Mills ed­ited the tran­script.

If you’re a cu­rated au­thor and in­ter­ested in giv­ing a 5-min talk, which will then be tran­scribed and ed­ited, sign up here.)

Stu­art Arm­strong: I’m go­ing to talk about Oc­cam’s ra­zor, and what it ac­tu­ally means in prac­tice. First, let’s go re­call the No Free Lunch the­o­rem: the sur­pris­ing idea is that sim­ply given an agent’s be­hav­ior, you can’t in­fer what is good or bad for that agent.

Now the even more sur­pris­ing re­sult is that, un­like most No Free Lunch the­o­rems, sim­plic­ity does not solve the prob­lem. In fact, if you look at the sim­plest ex­pla­na­tions for hu­man be­hav­ior, they tend to be things like “hu­mans are fully ra­tio­nal.” So, ac­cord­ing to that the­ory, at ev­ery sin­gle mo­ment of our lives, ev­ery sin­gle de­ci­sion we make is the ab­solute op­ti­mal de­ci­sion we could have made. Other sim­ple ex­pla­na­tions might be that hu­mans are fully anti-ra­tio­nal or hu­mans have flat (or zero) prefer­ences.

To ex­plore one part of the prob­lem, I’m go­ing to start with a situ­a­tion where you know ev­ery­thing about the agent. You know the agent has an in­put de­vice that takes ob­ser­va­tions, which feed into knowl­edge.

Now, hu­man prefer­ences can be changed by in­for­ma­tion, so a bit of in­for­ma­tion in­fluences prefer­ences. And then knowl­edge and prefer­ences com­bine to make the agent choose an ac­tion. This ac­tion goes out into the world, goes into the en­vi­ron­ment, and you get an­other ob­ser­va­tion. And so on. So this agent is ab­solutely known.

But no­tice that there’s a lot of here what Eliezer Yud­kowsky would call ”Sugges­tively Named Lisp To­kens.”

If I said, “This is the agent de­sign,” and gave you the agent but erased the la­bels, could you re­con­struct it? Well, not nec­es­sar­ily.

This is some­thing that is perfectly com­pat­i­ble with that de­com­po­si­tion. I’ve just in­ter­changed knowl­edge and prefer­ences, be­cause they serve the same form, in terms of nodes. But again, this is also mas­sively over­sim­plify­ing, be­cause each of these nodes has an in­ter­nal struc­ture, they’re pieces of al­gorithm. Things hap­pen in­side them.

I might say that this blue box-ish thing is knowl­edge. The in­put de­vice is a tiny piece there and the prefer­ences are that lit­tle sliver there. There are in­ter­nal struc­tures to those var­i­ous nodes where this is a plau­si­ble ex­pla­na­tion of the al­gorithm. The point of this is that even if you do know ex­actly what the al­gorithm is, there’s still a prob­lem that this all as­sumes the Carte­sian bound­ary be­tween the world and the agent. As we know from peo­ple who are work­ing at MIRI, this is not the case and this is not ob­vi­ous.

Is your body part of this blue box? Are your nerves part of it? If you’re driv­ing a car, peo­ple of­ten say that they feel the wheels of the car, so where is this bound­ary?

Again, if I sort of start talk­ing about knowl­edge and prefer­ences, and what the agent is, in fact, it might go all crazy with­out a bet­ter defi­ni­tion. And of course in the real world, things are vastly more com­pli­cated than this sim­ple thing I’ve done.

This illus­trates part of the prob­lem. That’s sort of to show you how the thing can be prob­le­matic in prac­tice, when trans­lated to var­i­ous other do­mains.

So, to con­clude:

Reg­u­lariza­tion is in­suffi­cient for in­verse re­in­force­ment learn­ing. Reg­u­lariza­tion is ba­si­cally sim­plifi­ca­tion in this con­text. You can’t de­duce things with just a good reg­u­larizer.

Un­su­per­vised learn­ing can never get the prefer­ences of an agent. You need at least semi-su­per­vised learn­ing. There has to be some su­per­vi­sion in there.

Hu­man the­ory of mind can’t be de­duced merely by ob­serv­ing hu­mans. If you could, you would solve the prefer­ence prob­lem, which you can’t.

Sup­pose a pro­gram­mer wants to filter out click­bait, but their fix re­moves some­thing that shouldn’t have been re­moved. Well, in that case the pro­gram­mer might say, “No, that’s ob­vi­ously not click­bait. Let’s re­move that. We’ll patch this bug.” Often what they’re do­ing in this sort of situ­a­tion is in­ject­ing their own val­ues.

The er­ror isn’t not an ob­jec­tive bug, it’s their own val­ues that are be­ing in­jected. It kind of works be­cause hu­mans tend to have very similar val­ues to each other, but it’s not ob­jec­tive.

Fi­nally, just to re­it­er­ate, val­ues learned are de­ter­mined by the im­plicit and ex­plicit as­sump­tions that you make in the sys­tem, so it’s im­por­tant to get those right.


Daniel Filan: It seems like a cru­cial as­pect of this im­pos­si­bil­ity re­sult (or in gen­eral the way you’re think­ing about it) is that de­ci­sions are made by some de­ci­sion-mak­ing al­gorithm, that takes in the state of the world as an in­put. And prefer­ences in the model just means any­thing that could pro­duce de­ci­sions that some­how are re­lated to the de­ci­sion-mak­ing al­gorithm.

But in re­al­ity, I think we have a richer no­tion of prefer­ences than that. For ex­am­ple, if you pre­fer some­thing, you might be more in­clined to choose it over other things.

To what ex­tent do you think that we should move to a no­tion of prefer­ences that’s more ro­bust than “any­thing in­side your head that af­fects your de­ci­sions in any pos­si­ble way”?

Stu­art Arm­strong: Yes, we need to move to a richer ver­sion of prefer­ences. The point of the slide is that in­stead of see­ing the agent as a black box, I’m us­ing some very loose as­sump­tions. I’m as­sum­ing that this is the net­work for how prefer­ences and knowl­edge and ac­tions feed into each other. I was just show­ing that adding that as­sump­tion doesn’t seem to be nearly enough to get you any­where. It gets you some­where. We’ve sliced away huge amounts of pos­si­bil­ities. But it doesn’t get us nearly far enough.

Ba­si­cally, the more that we know about hu­man prefer­ences, and the more that we know about the struc­tures, the more we can slice and get down to the bit that we want. It’s my hope that we don’t have to put too many hand­crafted as­sump­tions into our model, be­cause we’re not nearly as good at that sort of de­sign as we are at search, ac­cord­ing to a talk I’ve heard re­cently.

Ben Pace: Where do you think we can make progress in this do­main, whether it’s new nega­tive re­sults or a more pos­i­tive the­ory of which things should count as prefer­ences?

Stu­art Arm­strong: I’m work­ing with some­one at Deep­Mind, and we’re go­ing to be do­ing some ex­per­i­ments in this area, and then I’ll be able to give you a bet­ter idea.

How­ever, what to me would be the ul­ti­mate thing would be to un­lock all of the psy­chol­ogy re­search. Ba­si­cally, there’s a vast amount of psy­chol­ogy re­search out there. This is a huge amount of in­for­ma­tion. Now, we can’t use it for the mo­ment be­cause it’s ba­si­cally just text, and we need to have a cer­tain crit­i­cal mass of as­sump­tions be­fore we can make use of these things that say, “This is a bias. This isn’t.”

My hope is that the amount of stuff that we need in or­der to un­lock all that re­search is not that huge. So yes, so it’s ba­si­cally once we can get started, then there’s a huge amount. Maybe some­thing like the first 2% will be the hard­est, and the re­main­ing 98% will be easy af­ter that.

Scott Garrabrant: Con­nect­ing to Alex’s talk, I’m cu­ri­ous if you have any­thing to say about the differ­ence be­tween ex­tract­ing prefer­ences from a thing that’s do­ing search, as op­posed to ex­tract­ing prefer­ences from a thing that’s do­ing de­sign?

Stu­art Arm­strong: Well, one of the big prob­lems is that all the toy ex­am­ples are too sim­ple to be of much use. If you have an al­gorithm that just does un­re­stricted search, it’s a rel­a­tively short al­gorithm. And if I tell you this is an al­gorithm that does un­re­stricted search, there’s not many differ­ent ways that I can in­ter­pret it. So I need very lit­tle in­for­ma­tion to hone in on what it ac­tu­ally is.

The thing with de­sign is that it’s in­trin­si­cally more com­pli­cated, so a toy ex­am­ple might be more plau­si­ble. But I’m not sure. If you took a bounded re­al­is­tic search com­pared with bounded re­al­is­tic de­sign, I’m not sure which one would be eas­ier to fit into. Given that you know what each one is, I don’t know which one would be eas­ier to fit in.

I don’t think it’s an ar­ti­fact of search ver­sus de­sign, I think it’s an ar­ti­fact of the fact that toy search or prac­ti­cal search is sim­pler, and it’s eas­ier to fit things to sim­pler al­gorithms.

No comments.