Learning human preferences: black-box, white-box, and structured white-box access

This post is in­spired by sys­tem iden­ti­fi­ca­tion; how­ever, I’m not an ex­pert in that do­main, so any cor­rec­tions or in­spira­tions on that front are wel­come.

I want to thank Re­becca Gor­man for her idea on us­ing sys­tem iden­ti­fi­ca­tion, and her con­ver­sa­tions de­vel­op­ing the con­cept.

Know­ing an agent

This is an agent:

Fig. 1

We want to know about its in­ter­nal mechanisms, its soft­ware. But there are sev­eral things we could mean by that.

Black-box

First of all, we might be in­ter­ested in know­ing its in­put-out­put be­havi­our. I’ve called this its policy in pre­vi­ous posts; a full map that will al­low us to pre­dict its out­put in any cir­cum­stances:

Fig. 2

I’ll call this black-box knowl­edge of the agent’s in­ter­nals.

White-box

We might be in­ter­ested in know­ing more about what’s ac­tu­ally go­ing on in the agent’s al­gorithm, not just the out­puts. I’ll call this white-box knowl­edge; we would be in­ter­ested in some­thing like this (along with a de­tailed un­der­stand­ing of the in­ter­nals of the var­i­ous mod­ules):

Fig. 3

Struc­tured white-box

And, fi­nally, we might we in­ter­ested in know­ing what the in­ter­nal mod­ules ac­tu­ally do, or ac­tu­ally mean. This is the se­man­tics of the al­gorithm, re­sult­ing in some­thing like this:

Fig. 4

The “be­liefs”, “prefer­ences”, and “ac­tion se­lec­tors” are tags that ex­plain what these mod­ules are do­ing. The tags are part of the struc­ture of the al­gorithm, which in­cludes the ar­rows and setup.

If we know those, I’d call it struc­tured white-box knowl­edge.

Levels of access

We can have differ­ent lev­els of ac­cess to the agent. For ex­am­ple, we might be able to run it in­side any en­vi­ron­ment, but not pry it open; hence we know its full in­put-out­put be­havi­our. This would give us (full) black-box ac­cess to the agent (par­tial black box ac­cess would be know­ing some of its be­havi­our, but not in all situ­a­tions).

Or we might be able to fol­low its in­ter­nal struc­ture. This gives us white-box ac­cess to the agent. Hence we know its al­gorithm.

Or, fi­nally, we might have a full tagged and struc­tured di­a­gram of the whole agent. This gives us struc­tured white-box ac­cess to the agent (the term is my own).

Things can more com­pli­cated, of course. We could have only ac­cess to parts of the agent/​struc­ture/​tags. Or we could have a mix of differ­ent types of ac­cess—grey-box seems to be the term for some­thing be­tween black-box and white-box.

Hu­mans seem to have a mix­ture of black-box and struc­tured white-box ac­cess to each other—we can ob­serve each other’s be­havi­our, and we have our in­ter­nal the­ory of mind that pro­vides in­for­ma­tion like “if some­one freezes up on a pub­lic speak­ing stage, they’re prob­a­bly filled with fear”.

Ac­cess and knowledge

Com­plete ac­cess at one level gives com­plete knowl­edge at that level. So, if you have com­plete black-box ac­cess to the agent, you have com­plete black-box knowl­edge: you could, at least in prin­ci­ple, com­pute ev­ery in­put-out­put map just by run­ning the agent.

So the in­ter­est­ing the­o­ret­i­cal challenges are those that in­volve hav­ing ac­cess at one level and try­ing to in­fer a higher level, or hav­ing par­tial ac­cess at one or mul­ti­ple lev­els and try­ing to in­fer full knowl­edge.

Mul­ti­ple white boxes for a sin­gle black box

Black-box and white-box iden­ti­fi­ca­tion are have been stud­ied some­what ex­ten­sively in sys­tem iden­ti­fi­ca­tion. One fact re­mains true: there are mul­ti­ple white-box in­ter­pre­ta­tions of the same black-box ac­cess.

We can have the “an­gels push­ing par­ti­cles to re­sem­ble gen­eral rel­a­tivity” situ­a­tions. We can add use­less epicy­cles, that do noth­ing, to the model of the white-box; this gives us a more com­pli­cated white-box with iden­ti­cal black-box be­havi­our. Or you could have the ma­trix me­chan­ics vs wave me­chan­ics situ­a­tion in quan­tum me­chan­ics, where two very differ­ent for­mu­la­tions were shown to be equiv­a­lent.

There are mul­ti­ple ways of choos­ing among equiv­a­lent white-box mod­els. In sys­tem iden­ti­fi­ca­tion, the crite­ria seems to be “go with what works”: the model is to be iden­ti­fied for a spe­cific pur­pose (for ex­am­ple, to en­able con­trol of a sys­tem) and that pur­pose gives crite­ria that will se­lect the right kind of model. For ex­am­ple, lin­ear re­gres­sion will work in many rough-and-ready cir­cum­stances, while it would be stupid to use it for cal­ibrat­ing sen­si­tive par­ti­cle de­tec­tors when much bet­ter mod­els are available. Differ­ent prob­lems have differ­ent trade-offs.

Another ap­proach is the so called “grey-box” ap­proach, where a class of mod­els is se­lected in ad­vance, and this class is up­dated with the black-box data. Here the in­ves­ti­ga­tor is mak­ing “mod­el­ling as­sump­tions” that cut down on the pos­si­ble space of white-box mod­els to con­sider.

Fi­nally, in this com­mu­nity and among some philoso­phers, al­gorith­mic sim­plic­ity is seen as good and prin­ci­pled way of de­cid­ing be­tween equiv­a­lent white-box mod­els.

Mul­ti­ple struc­tures and tags for one white-box

A similar is­sue hap­pens again at a higher level: there are mul­ti­ple ways of as­sign­ing tags to the same white-box sys­tem. Take the model in figure 4, and erase all the tags (hence giv­ing us figure 3). Now re­as­sign those tags; there are mul­ti­ple ways we could tag the mod­ules, and still have the same struc­ture as figure 4:

Fig. 5

We might ob­ject, at this point, in­sist­ing that tags like “be­liefs” and “prefer­ences” be as­signed to mod­ules for a rea­son, not just be­cause the struc­ture is cor­rect. But hav­ing a good rea­son to as­sign those tags is pre­cisely the challenge.

We’ll look more into that is­sue in fu­ture sec­tions, but here I should point out that if we con­sider the tags as purely syn­tac­tic, then we can as­sign any tag to any­thing:

Fig. 6

What’s “Tuna”? What­ever we want it to be.

And since we haven’t defined the mod­ules or said any­thing about their size and roles, we can de­com­pose the in­te­rior of the mod­ules and as­sign tag in com­pletely differ­ent ways:

Fig. 7

Nor­ma­tive as­sump­tions, tags, and struc­tural assumptions

We need to do bet­ter than that. Paper “Oc­cam’s ra­zor is in­suffi­cient to in­fer the prefer­ences of ir­ra­tional agents” talked about “nor­ma­tive as­sump­tions”, as­sump­tions about the val­ues (or the bi­ases) of the agent.

In this more gen­eral set­ting, I’ll re­fer to them as “struc­tural as­sump­tions”, as they can re­fer to be­liefs, or other fea­tures of the in­ter­nal struc­ture and tags of the agent.

Al­most triv­ial struc­tural assumptions

Th­ese struc­tural as­sump­tions can be al­most triv­ial; for ex­am­ple, say­ing “be­liefs nad prefer­ences up­date from knowl­edge, and up­date the ac­tion se­lec­tor”, is enough to rule out figures 6 and 7. This is equiv­a­lent with start­ing with figure 4, eras­ing the tags, and want­ing to re­as­sign tags to the al­gorithm while en­sur­ing the graph is iso­mor­phic to figure 4. Hence we have a “de­sired graph” that we want to fit our al­gorithm into.

What the Oc­cam’s ra­zor pa­per shows is that we can’t get good re­sults from “de­sired graph + sim­plic­ity as­sump­tions”. This is un­like the black-box to white-box tran­si­tion, where sim­plic­ity as­sump­tions are very effec­tive on their own.

Figure 5 demon­strated that above: the be­liefs and prefer­ence mod­ules can be tagged as each other, and we can still get the same de­sired graph. Even worse, since we still haven’t speci­fied any­thing about the size of these mod­ules, the fol­low­ing tag as­sign­ment is also pos­si­ble. Here, the be­lief and prefer­ence “mod­ule” have been re­duced to mere con­duits, that pass on the in­for­ma­tion to the ac­tion se­lec­tor, that has ex­panded to gob­ble up all of the rest of the agent.

Fig. 8

Note that this de­com­po­si­tion is sim­pler than a “rea­son­able” ver­sion of figure 4, since the bound­aries be­tween the three mod­ules don’t need to be speci­fied. Hence al­gorith­mic sim­plic­ity will tend to se­lect these de­gen­er­ate struc­tures more of­ten. Note this is al­most ex­actly the “in­differ­ent plan­ner” of the Oc­cam’s ra­zor pa­per, one of the three sim­ple de­gen­er­ate struc­tures. The other two—the greedy and anti-greedy plan­ners—are situ­a­tions where the “Prefer­ences” mod­ule has ex­panded to full size, with the ac­tion se­lec­tor re­duced to a small ap­pendage.

Ad­ding se­man­tics or “thick” concepts

To avoid those prob­lems, we need to flesh out the con­cepts of “be­liefs”, “prefer­ences[1]”, and so on. The more struc­tural as­sump­tions we put on these con­cepts, the more we can avoid de­gen­er­ate struc­tured white-box solu­tions[2].

So we want some­thing closer to our un­der­stand­ing of prefer­ences and be­liefs. For ex­am­ple, prefer­ences are sup­posed to change much more slowly than be­liefs. So the im­pact of ob­ser­va­tions on the prefer­ence mod­ule—in an in­for­ma­tion-the­o­retic sense, maybe—would be much lower than on the be­liefs mod­ules, or at least much slower. Ad­ding that as a struc­tural as­sump­tion cuts down on the num­ber of pos­si­ble struc­tured white-box solu­tions.

And it we are deal­ing with hu­mans, try­ing to figure out their prefer­ence—which is my grand pro­ject at this time—then we can add a lot of other struc­tural as­sump­tions. “Si­tu­a­tion X is one that up­dates prefer­ences”; “this be­havi­our shows a bias”; “sud­den up­dates in prefer­ences are ac­com­panied by large per­sonal crises”; “red faces and shout­ing de­notes anger”, etc...

Ba­si­cally any judge­ment we can make about hu­man prefer­ences can be used, if added ex­plic­itly, to re­strict the space of pos­si­ble struc­tured white-box solu­tions. But these need to be added in ex­plic­itly at some level, not just de­duced from ob­ser­va­tions (ie su­per­vised, not un­su­per­vised learn­ing), since ob­ser­va­tions can only get you as far as white-box knowl­edge.

Note the similar­ity with se­man­ti­cally thick con­cepts and with my own post on get­ting se­man­tics em­piri­cally. Ba­si­cally, we want an un­der­stand­ing of “prefer­ences” that is so rich that only some­thing that is clearly a “prefer­ence” can fit the model.

In the op­ti­mistic sce­nario, a few such struc­tural as­sump­tions are enough to en­able an al­gorithm to quickly grasp hu­man the­ory of mind and quickly sort our brain into plau­si­ble mod­ules, and hence iso­late our prefer­ences. In the pes­simistic sce­nario, the­ory of mind, prefer­ences, be­liefs, and bi­ases are all so twisted to­gether that even ex­ten­sive ex­am­ples are not enough to de­com­pose them. See more in this post.


  1. We might ob­ject to the ar­row from ob­ser­va­tions to “prefer­ences”: prefer­ences are not sup­posed to change, at least for ideal agents. But many agents are far from ideal (in­clud­ing hu­mans); we don’t want the whole method to fail be­cause there was a stray bit of code or neu­ron go­ing in one di­rec­tion, or be­cause two mod­ules reused the same code or the same mem­ory space. ↩︎

  2. Note that I don’t give a rigid dis­tinc­tion be­tween syn­tax and se­man­tics/​mean­ing/​”ground truth”. As we ac­cu­mu­late more and more syn­tac­ti­cal re­stric­tions, the num­ber of plau­si­ble se­man­tic struc­tures plunges. ↩︎