Hierarchical system preferences and subagent preferences

Dur­ing NeurIPS, I had a dis­cus­sion with Dy­lan Had­field-Menell about how to fo­ma­l­ise my meth­ods for ex­tract­ing hu­man val­ues.

He brought up the is­sue of how to figure out the goals of a hi­er­ar­chi­cal sys­tem—for ex­am­ple, a sim­plified cor­po­ra­tion. It’s well known that not ev­ery­one in a cor­po­ra­tion acts in the best in­ter­ests of the cor­po­ra­tion. So, a cor­po­ra­tion will not act at all like a ra­tio­nal share­holder value max­imiser. De­spite this, is there a way of some­how de­duc­ing, from its be­havi­our and its in­ter­nal struc­ture, that its true goals are share­holder max­imi­sa­tion?

I im­me­di­ately started tak­ing the ex­am­ple too liter­ally, and ob­jected to the premise, men­tion­ing the goals of sub­agents within the cor­po­ra­tion. We were to a large ex­tent talk­ing past each other; I was strongly aware of the fact that even if a sys­tem had its ori­gin in one set of prefer­ences, that didn’t mean it couldn’t have dis­parate prefer­ences of its own. Dy­lan, on the other hand, was in­ter­ested in for­mal­is­ing my ideas in a sim­ple set­ting. This is es­pe­cially use­ful, as the brain it­self is a hi­er­ar­chi­cal sys­tem with sub­agents.

So here is my at­tempt to fo­ma­l­ise the prob­lem, and pre­sent my ob­jec­tion, all at the same time.

The set­ting and actions

I’m choos­ing a very sim­plified ver­sion of the hi­er­ar­chi­cal plan­ning prob­lem pre­sented in this pa­per.

A robot, in green, is tasked with mov­ing the ob­ject A to the empty bot­tom stor­age area. The top level of the hi­er­ar­chi­cal plan for­mu­lates the plan of grip­ping ob­ject A.

This plan is then passed down to a lower level of the hi­er­ar­chy, that re­al­ises that first it must get B out of the way. This is passed to a lower level that is tasked with grip­ping B. It does so, but, for some rea­son, it first turns on the mu­sic:

Then it moves B out of the way (this will in­volve mov­ing up and down the plan­ning hi­er­ar­chy a few times, by first grip­ping B, then mov­ing it, then re­leas­ing the grip, etc...):

Then the plan­ning moves up the hi­er­ar­chy again, which re-is­sues the “grip A” com­mand. The robot does that, and even­tu­ally moves A into stor­age:

The ir­ra­tional­ity of music

Now this seems to be mainly a ra­tio­nal plan for mov­ing A to stor­age, ex­cept for the mu­sic part. Is the sys­tem try­ing to move A but is be­ing ir­ra­tional at one ac­tion? Or does the mu­sic re­flect some sort of prefer­ences for the hi­er­ar­chi­cal sys­tem?

Well, to sort that out, we can look within the agent. Let’s look at the level of the hi­er­ar­chy that is tasked with grip­ping B, and that turns on the mu­sic. In a mas­sively pseudo ver­sion of pseudo code, here is one ver­sion of this:

The crite­ria just looks at the plans and gauges whether they will move the state closer to .

In this situ­a­tion, it’s clear that the al­gorithm is try­ing to grip B, but has some delu­sion that turn­ing the ra­dio on will help it do so (maybe in a pre­vi­ous ex­per­i­ment the vibra­tions shook B loose?). The al­gorithm will first try the ra­dio plan, and, when it fails, will then try the sec­ond plan and grip B the con­ven­tional way.

Con­trast that with this code:

Now, there is the usual prob­lem with call­ing a vari­able “”, and that’s the rea­son I’ve been look­ing into syn­tax and se­man­tics.

But even with­out solv­ing these is­sues, it’s clear that this part of the sys­tem has in­her­ited a differ­ent goal than the rest of the hi­er­ar­chi­cal sys­tem. It’s a sub­agent with non-al­igned prefer­ences. Turn­ing on the mu­sic is not an ac­ci­dent, it’s a de­liber­ate and ra­tio­nal move on the part of this sub­agent.

The goals of the system

For Al­gorithm 1, the sys­tem clearly has the goal of of mov­ing A to stor­age, and has a failure of ra­tio­nal­ity along the way.

For Al­gorithm 2, there are sev­eral pos­si­ble goals: mov­ing A to stor­age, turn­ing the ra­dio on, and var­i­ous mixes of these goals. Which is the right in­ter­pre­ta­tion of the goal? We can model the hi­er­ar­chy in differ­ent ways, that sug­gest differ­ent in­tu­itions for the “true” goal—for ex­am­ple, Al­gorithm 2 could ac­tu­ally be a hu­man, act­ing on in­struc­tions from a man­ager. Or it could be a sub-mod­ule of a brain, one that is un­en­dorsed by the rest of the brain.

My take on it is that the “true goal” of the sys­tem is un­der­speci­fied. From the out­side, my judge­ment would be that the closer the sub­agent re­sem­bles an in­de­pen­dent be­ing, and the closer that re­sem­bles true en­joy­ment.

From the in­side, though, the sys­tem may have many meta-prefer­ences that push it in one di­rec­tion or an­other. For ex­am­ple, if there was a mod­ule that analysed the perfor­mance of the rest of the al­gorithm, and recom­mended changes; and if that al­gorithm sys­tem­at­i­cally changed the sys­tem to­wards be­com­ing a A-stor­age-mover (such as recom­mend­ing delet­ing the goal), then that would be a strong meta-prefer­ence to­wards only hav­ing the stor­age goal. And if we used the meth­ods here, we’d get mov­ing A the main goal, plus some noise.

Con­versely, maybe there are pro­cesses that are ok with the sub­agent’s be­havi­our. A mod­ern cor­po­ra­tion, for in­stance, has le­gal limits on what it can in­flict on its work­ers, so some of the meta-prefer­ences of the cor­po­ra­tion (lo­cated for ex­am­ple in HR and the le­gal de­part­ment) are geared to­wards some liv­able work en­vi­ron­ments, even if this doesn’t in­crease share­holder value.

(Now, I still might like to in­ter­vene, in some cases—even if the goal of a slave state in­vad­ing their neigh­bours is clear at the ob­ject and meta-level, I can still value sub­agents even if they don’t).

In the ab­sence of some sort of meta-prefer­ences, there are mul­ti­ple ways of es­tab­lish­ing the prefer­ences of a hi­er­ar­chi­cal sys­tem, and many of them are equally valid.