Model splintering: moving from one imperfect model to another

1. The big problem

In the last few months, I’ve be­come con­vinced that there is a key meta-is­sue in AI safety; a prob­lem that seems to come up in all sorts of ar­eas.

It’s hard to sum­marise, but my best phras­ing would be:

  • Many prob­lems in AI safety seem to be vari­a­tions of “this ap­proach seems safe in this im­perfect model, but when we gen­er­al­ise the model more, it be­comes dan­ger­ously un­der­defined”. Call this model splin­ter­ing.

  • It is in­trin­si­cally worth study­ing how to (safely) tran­si­tion from one im­perfect model to an­other. This is worth do­ing, in­de­pen­dently of what­ever “perfect” or “ideal” model might be in the back­ground of the im­perfect mod­els.

This sprawl­ing post will be pre­sent­ing ex­am­ples of model splin­ter­ing, ar­gu­ments for its im­por­tance, a for­mal set­ting al­low­ing us to talk about it, and some uses we can put this set­ting to.

1.1 In the lan­guage of tra­di­tional ML

In the lan­guage of tra­di­tional ML, we could con­nect all these is­sues to “out-of-dis­tri­bu­tion” be­havi­our. This is the prob­lems that al­gorithms en­counter when the set they are op­er­at­ing on is drawn from a differ­ent dis­tri­bu­tion than the train­ing set they were trained on.

Hu­mans can of­ten see that the al­gorithm is out-of-dis­tri­bu­tion and cor­rect it, be­cause we have a more gen­eral dis­tri­bu­tion in mind than the one the al­gorithm was trained on.

In these terms, the is­sues of this post can be phrased as:

  1. When the AI finds it­self mildly out-of-dis­tri­bu­tion, how best can it ex­tend its prior knowl­edge to the new situ­a­tion?

  2. What should the AI do if it finds it­self strongly out-of-dis­tri­bu­tion?

  3. What should the AI do if it finds it­self strongly out-of-dis­tri­bu­tion, and hu­mans don’t know the cor­rect dis­tri­bu­tion ei­ther?

1.2 Model splin­ter­ing examples

Let’s build a more gen­eral frame­work. Say that you start with some brilli­ant idea for AI safety/​al­ign­ment/​effec­tive­ness. This idea is phrased in some (im­perfect) model. Then “model splin­ter­ing” hap­pens when you or the AI move to a new (also im­perfect) model, such that the brilli­ant idea is un­der­mined or un­der­defined.

Here are a few ex­am­ples:

  • You de­sign an AI CEO as a money max­imiser. Given typ­i­cal as­sump­tions about the hu­man world (le­gal sys­tems, difficul­ties in one per­son achiev­ing mas­sive power, hu­man fal­li­bil­ities), this re­sults in an AI that be­haves like a hu­man CEO. But when those as­sump­tions fail, the AI can end up feed­ing the uni­verse to a money-mak­ing pro­cess that pro­duces noth­ing of any value.

  • Eliezer defined “rubes” as smooth red cubes con­tain­ing pal­la­dium that don’t glow in the dark. “Bleggs”, on the other hand, are furred blue eggs con­tain­ing vana­dium that glow in the dark. To clas­sify these, we only need a model with two fea­tures, “rubes” and “bleggs”. Then along comes a furred red egg con­tain­ing vana­dium that doesn’t glow in the dark. The pre­vi­ous model doesn’t know what to do with it, and if you get a model with more fea­tures, it’s un­clear what to do with this new ob­ject.

  • Here are some moral prin­ci­ples from his­tory: hon­our is im­por­tant for any­one. Women should be pro­tected. In­creas­ing hap­piness is im­por­tant. Th­ese moral prin­ci­ples made sense in the world in which they were ar­tic­u­lated, where fea­tures like “hon­our”, “gen­der”, and “hap­piness” are rel­a­tively clear and un­am­bigu­ous. But the world changed, and the mod­els splin­tered. “Honour” be­came hope­lessly con­fused cen­turies ago. Gen­der is cur­rently finish­ing its long splin­ter­ing (long be­fore we got to to­day, gen­der started be­com­ing less use­ful for clas­sify­ing peo­ple, hence the con­se­quences of gen­der splin­tered a long time be­fore gen­der it­self did). Hap­piness, or at least he­do­nic hap­piness, is still well defined, but we can clearly see how this is go­ing to splin­ter when we talk about wor­lds of up­loads or brain mod­ifi­ca­tion.

  • Many tran­si­tions in the laws of physics—from the ideal gas laws to the more ad­vanced van der Waals equa­tions, or from New­ton­ain physics to gen­eral rel­a­tivity to quan­tum grav­ity—will cause splin­ter­ing if prefer­ences were ar­tic­u­lated in con­cepts that don’t carry over well.

1.3 Avoid­ing perfect models

In all those cases, there are ways of im­prov­ing the tran­si­tion, with­out need­ing to go via some ideal­ised, perfect model. We want to define the AI CEO’s task in more gen­er­al­ity, but we don’t need to define this across ev­ery pos­si­ble uni­verse—that is not needed to re­strain its be­havi­our. We need to dis­t­in­guish any blegg from any rube we are likely to en­counter, we don’t need to define the pla­tonic essence of “bleg­gness”. For fu­ture splin­ter­ings—when he­do­nic hap­piness splin­ters, when we get a model of quan­tum grav­ity, etc… - we want to know what to do then and there, even if there are fu­ture splin­ter­ings sub­se­quent to those.

And I think think that model splin­ter­ing is best ad­dressed di­rectly, rather than us­ing meth­ods that go via some ideal­ised perfect model. Most ap­proaches seem to go for ap­prox­i­mat­ing an ideal: from AIXI’s set of all pro­grams, the uni­ver­sal prior, KWIK (“Know­ing what it knows”) learn­ing with a full hy­poth­e­sis class, Ac­tive In­verse Re­ward De­sign with its full space of “true” re­ward func­tions, to Q-learn­ing which as­sumes any Markov de­ci­sions pro­cess is pos­si­ble. Then the prac­ti­cal ap­proaches rely on ap­prox­i­mat­ing this ideal.

Schemat­i­cally, we can see as the ideal, as up­dated with in­for­ma­tion to time , and as an ap­prox­i­ma­tion of . Then we tend to fo­cus on how well ap­prox­i­mates , and on how changes to - rather than on how re­lates to ; the red ar­row here is un­der­analysed:

2 Why fo­cus on the tran­si­tion?

But why is fo­cus­ing on the tran­si­tion im­por­tant?

2.1 Hu­mans rea­son like this

A lot has been writ­ten about image recog­ni­tion pro­grams go­ing “out-of-dis­tri­bu­tion” (en­coun­ter­ing situ­a­tions be­yond its train­ing en­vi­ron­ment) or suc­cumb­ing to “ad­ver­sar­ial ex­am­ples” (ex­am­ples from one cat­e­gory that have the fea­tures of an­other). In­deed, some peo­ple have shown how to use la­bel­led ad­ver­sar­ial ex­am­ples to im­prove image recog­ni­tion.

You know what this re­minds me of? Hu­man moral rea­son­ing. At var­i­ous points in our lives, we hu­mans seem to have pretty solid moral in­tu­itions about how the world should be. And then, we typ­i­cally learn more, re­al­ise that things don’t fit in the cat­e­gories we were used to (go “out-of-dis­tri­bu­tion”) and have to up­date. Some peo­ple push sto­ries at us that ex­ploit some of our emo­tions in new, more am­bigu­ous cir­cum­stances (“ad­ver­sar­ial ex­am­ples”). And philoso­phers use similarly-de­signed thought ex­per­i­ments to open up and clar­ify our moral in­tu­itions.

Ba­si­cally, we start with strong moral in­tu­itions on un­der-defined fea­tures, and when the fea­tures splin­ter, we have to figure out what to do with our pre­vi­ous moral in­tu­itions. A lot of de­vel­op­ing moral meta-in­tu­itions, is about learn­ing how to nav­i­gate these kinds of tran­si­tions; AIs need to be able to do so too.

2.2 There are no well-defined over­ar­ch­ing moral principles

Mo­ral re­al­ists and moral non-re­al­ists agree more than you’d think. In this situ­a­tion, we can agree on one thing: there is no well-de­scribed sys­tem of moral­ity that can be “sim­ply” im­ple­ment in AI.

To over-sim­plify, moral re­al­ists hope to dis­cover this moral sys­tem, moral non-re­al­ists hope to con­struct one. But, cur­rently, it doesn’t ex­ist in an im­ple­mentable form, nor is there any im­ple­mentable al­gorithm to dis­cover/​con­struct it. So the whole idea of ap­prox­i­mat­ing an ideal is wrong.

All hu­mans seem to start from a par­tial list of moral rules of thumb, rules that they then have to ex­tend to new situ­a­tions. And most hu­mans do seem to have some meta-rules for defin­ing moral im­prove­ments, or ex­ten­sions to new situ­a­tions.

We don’t know perfec­tion, but we do know im­prove­ments and ex­ten­sions. So meth­ods that deal ex­plic­itly with that are use­ful. Those are things we can build on.

2.3 It helps dis­t­in­guish ar­eas where AIs fail, from ar­eas where hu­mans are uncertain

Some­times the AI goes out-of-dis­tri­bu­tion, and hu­mans can see the er­ror (no, flip­ping the lego block doesn’t count as putting it on top of the other). There are cases when hu­mans them­selves go out-of-dis­tri­bu­tion (see for ex­am­ple siren wor­lds).

It’s use­ful to have meth­ods available for both AIs and hu­mans in these situ­a­tions, and to dis­t­in­guish them. “Gen­uine hu­man prefer­ences, not ex­pressed in suffi­cient de­tail” is not the same as “hu­man prefer­ences fun­da­men­tally un­der­defined”.

In the first case, it needs more hu­man feed­back; in the sec­ond case, it needs to figure out way of re­solv­ing the am­bi­guity, know­ing that so­lic­it­ing feed­back is not enough.

2.4 We don’t need to make the prob­lems harder

Sup­pose that quan­tum me­chan­ics is the true un­der­ly­ing physics of the uni­verse, with some added bits to in­clude grav­ity. If that’s true, why would we need a moral the­ory valid in ev­ery pos­si­ble uni­verse? It would be use­ful to have that, but would be strictly harder than one valid in the ac­tual uni­verse.

Also, some prob­lems might be en­tirely avoided. We don’t need to figure out the moral­ity of deal­ing with a will­ing slave race—if we never en­counter or build one in the first place.

So a few de­grees of “ex­tend this moral model in a rea­son­able way” might be suffi­cient, with­out need­ing to solve the whole prob­lem. Or, at least, with­out need­ing to solve the whole prob­lem in ad­vance—a suc­cess­ful nanny AI might be built on these kinds of ex­ten­sions.

2.5 We don’t know how deep the rab­bit hole goes

In a sort of con­verse to the pre­vi­ous point, what if the laws of physics are rad­i­cally differ­ent from what we thought—what if, for ex­am­ple, they al­low some forms of time-travel, or have some nar­ra­tive fea­tures, or, more sim­ply, what if the agent moves to an em­bed­ded agency model? What if hy­per­com­pu­ta­tion is pos­si­ble?

It’s easy to have an ideal­ised ver­sion of “all re­al­ity” that doesn’t al­low for these pos­si­bil­ities, so the ideal can be too re­stric­tive, rather than too gen­eral. But the model splin­ter­ing meth­ods might still work, since it deals with tran­si­tions, not ideals.

Note that, in ret­ro­spect, we can always put this in a Bayesian frame­work, once we have a rich enough set of en­vi­ron­ments and up­dates rules. But this is mis­lead­ing: the key is­sue is the miss­ing fea­ture, and figur­ing out what to do with the miss­ing fea­ture is the real challenge. The fact that we could have done this in a Bayesian way if we already knew that fea­ture, is not rele­vant here.

2.6 We of­ten only need to solve par­tial problems

As­sume the blegg and rube clas­sifier is an in­dus­trial robot perform­ing a task. If hu­mans filter out any atyp­i­cal bleggs and rubes be­fore it sees them, then the robot has no need for a full the­ory of bleg­gness/​rube­ness.

But what it the hu­man fil­ter­ing is not perfect? Then the clas­sifier still doesn’t need a full the­ory of bleg­gness/​rube­ness; it needs meth­ods for deal­ing with the am­bi­gui­ties it ac­tu­ally en­coun­ters.

Some ideas for AI con­trol—low im­pact, AI-as-ser­vice, Or­a­cles, … - may re­quire deal­ing with some model splin­ter­ing, some am­bi­guity, but not the whole amount.

2.7 It points out when to be conservative

Some meth­ods, like quan­tiliz­ers or the pes­simism ap­proach rely on the al­gorithm hav­ing a cer­tain de­gree of con­ser­vatism. But, as I’ve ar­gued, it’s not clear to what ex­tent these meth­ods ac­tu­ally are con­ser­va­tive, nor is it easy to cal­ibrate them in a use­ful way.

Model splin­ter­ing situ­a­tions provide ex­cel­lent points at which to be con­ser­va­tive. Or, for al­gorithms that need hu­man feed­back, but not con­stantly, these are ex­cel­lent points to ask for that feed­back.

2.8 Difficulty in cap­tur­ing splin­ter­ing from the ideal­ised perspective

Gen­er­ally speak­ing, ideal­ised meth­ods can’t cap­ture model splin­ter­ing at the point we would want it to. Imag­ine an on­tolog­i­cal crisis, as we move from clas­si­cal physics to quan­tum me­chan­ics.

AIXI can go over the tran­si­tion fine: it shifts from a Tur­ing ma­chine mimick­ing clas­si­cal physics ob­ser­va­tions, to one mimick­ing quan­tum ob­ser­va­tions. But it doesn’t no­tice any­thing spe­cial about the tran­si­tion: chang­ing the prob­a­bil­ity of var­i­ous Tur­ing ma­chines is what it does with ob­ser­va­tions in gen­eral; there’s noth­ing in its al­gorithm that shows that some­thing un­usual has oc­curred for this par­tic­u­lar shift.

2.9 It may help am­plifi­ca­tion and distillation

This could be seen as a sub-point of some of the pre­vi­ous two sec­tions, but it de­serves to be flagged ex­plic­itly, since iter­ated am­plifi­ca­tion and dis­til­la­tion is one of the ma­jor po­ten­tial routes to AI safety.

To quote a line from that sum­mary post:

  1. The pro­posed AI de­sign is to use a safe but slow way of scal­ing up an AI’s ca­pa­bil­ities, dis­till this into a faster but slightly weaker AI, which can be scaled up safely again, and to iter­ate the pro­cess un­til we have a fast and pow­er­ful AI.

At both “scal­ing up an AI’s ca­pa­bil­ities”, and “dis­till this into”, we can ask the ques­tion: has the prob­lem the AI is work­ing on changed? The dis­til­la­tion step is more of a clas­si­cal AI safety is­sue, as we won­der whether the dis­til­la­tion has caused any value drift. But at the scal­ing up or am­plifi­ca­tion step, we can ask: since the AIs ca­pa­bil­ities have changed, the set of pos­si­ble en­vi­ron­ments it op­er­ates in has changed as well. Has this caused a splin­ter­ing where the pre­vi­ously safe goals of the AI have be­come dan­ger­ous.

De­tect­ing and deal­ing with such a splin­ter­ing could both be use­ful tools to add to this method.

2.10 Ex­am­ples of model splin­ter­ing prob­lems/​approaches

At a meta level, most prob­lems in AI safety seem to be var­i­ants of model splin­ter­ing, in­clud­ing:

Al­most ev­ery re­cent post I’ve read in AI safety, I’ve been able to con­nect back to this cen­tral idea. Now, we have to be cau­tious—cure-alls cure noth­ing, af­ter all, so it’s not nec­es­sar­ily a pos­i­tive sign that ev­ery­thing seems to fit into this frame­work.

Still, I think it’s worth div­ing into this, es­pe­cially as I’ve come up with a frame­work that seems promis­ing for ac­tu­ally solv­ing this is­sue in many cases.

In a similar con­cept-space is Abram’s or­tho­dox case against util­ity func­tions, where he talks about the Jeffrey-Bolker ax­ioms, which al­lows the con­struc­tion of prefer­ences from events with­out need­ing full wor­lds at all.

3 The virtues of formalisms

This post is ded­i­cated to ex­plic­itly mod­el­ling the tran­si­tion to am­bi­guity, and then show­ing what we can gain from this ex­plicit meta-mod­el­ling. It will do with some for­mal lan­guage (made fully for­mal in this post), and a lot of ex­am­ples.

Just as Scott ar­gues that if it’s worth do­ing, it’s worth do­ing with made up statis­tics, I’d ar­gue that if an idea is worth pur­su­ing, it’s worth pur­su­ing with an at­tempted for­mal­ism.

For­mal­isms are great at illus­trat­ing the prob­lems, clar­ify­ing ideas, and mak­ing us fa­mil­iar with the in­tri­ca­cies of the over­all con­cept. That’s the rea­son that this post (and the ac­com­pa­ny­ing tech­ni­cal post) will at­tempt to make the for­mal­ism rea­son­ably rigor­ous. I’ve learnt a lot about this in the pro­cess of for­mal­i­sa­tion.

3.1 A model, in (al­most) all generality

What do we mean by a model? Do we mean math­e­mat­i­cal model the­ory? As we talk­ing about causal mod­els, or causal graphs? AIXI uses a dis­tri­bu­tion over pos­si­ble Tur­ing ma­chines, whereas Markov De­ci­sion Pro­cesses (MDPs) sees states and ac­tions up­dat­ing stochas­ti­cally, in­de­pen­dently at each time-step. Un­like the pre­vi­ous two, New­to­nian me­chan­ics doesn’t use time-steps but con­tin­u­ous times, while gen­eral rel­a­tivity weaves time into the struc­ture of space it­self.

And what does it mean for a model to make “pre­dic­tions”? AIXI and MDPs make pre­dic­tion over fu­ture ob­ser­va­tions, and causal graphs are similar. We can also try run­ning them in re­verse, “pre­dict­ing” past ob­ser­va­tions from cur­rent ones. Math­e­mat­i­cal model the­ory talks about prop­er­ties and the ex­is­tence or non-ex­is­tence of cer­tain ob­jects. Ideal gas laws make a “pre­dic­tion” of cer­tain prop­er­ties (eg tem­per­a­ture) given cer­tain oth­ers (eg vol­ume, pres­sure, amount of sub­stance). Gen­eral rel­a­tivity es­tab­lishes that the struc­ture of space-time must obey cer­tain con­straints.

It seems tricky to in­clude all these mod­els un­der the same meta-model for­mal­ism, but it would be good to do so. That’s be­cause of the risk of on­tolog­i­cal crises: we want the AI to be able to con­tinue func­tion­ing even if the ini­tial model we gave it was in­com­plete or in­cor­rect.

3.2 Meta-model: mod­els, fea­tures, en­vi­ron­ments, probabilities

All of the mod­els men­tioned above share one com­mon char­ac­ter­is­tic: once you know some facts, you can de­duce some other facts (at least prob­a­bil­is­ti­cally). A pre­dic­tion of the next time step, a retro­d­ic­tion of the past, a de­duc­tion of some prop­er­ties from other, or a con­straint on the shape of the uni­verse: all of these say that if we know some things, then this puts con­straints on some other things.

So let’s define , in­for­mally, as the set of fea­tures of a model. This could be the gas pres­sure in a room, a set of past ob­ser­va­tions, the lo­cal cur­va­ture of space-time, the mo­men­tum of a par­ti­cle, and so on.

So we can define a pre­dic­tion as a prob­a­bil­ity dis­tri­bu­tion over a set of pos­si­ble fea­tures , given a base set of fea­tures, :

Do we need any­thing else? Yes, we need a set of pos­si­ble en­vi­ron­ments for which the model is (some­what) valid. New­to­nian physics fails at ex­treme en­er­gies, speeds, or grav­i­ta­tional fields; we’d like to in­clude this “do­main of val­idity” in the model defi­ni­tion. This will be very use­ful for ex­tend­ing mod­els, or tran­si­tion­ing from one model to an­other.

You might be tempted to define a set of “wor­lds” on which the model is valid. But we’re try­ing to avoid that, as the “wor­lds” may not be very use­ful for un­der­stand­ing the model. More­over, we don’t have spe­cial ac­cess to the un­der­ly­ing re­al­ity; so we never know whether there ac­tu­ally is a Tur­ing ma­chine be­hind the world or not.

So define , the en­vi­ron­ment on which the model is valid, as a set of pos­si­ble fea­tures. So if we want to talk about New­to­nian me­chan­ics, would be a set of New­to­nian fea­tures (mass, ve­loc­ity, dis­tance, time, an­gu­lar mo­men­tum, and so on) and would be the set of these val­ues where rel­a­tivis­tic and quan­tum effects make lit­tle differ­ence.

So see a model as

for a set of fea­tures, a set of en­vi­ron­ments, and a prob­a­bil­ity dis­tri­bu­tion. This is such that, for , we have the con­di­tional prob­a­bil­ity:

Though is defined for , we gen­er­ally want it to be us­able from small sub­sets of the fea­tures: so should be sim­ple to define from . And we’ll of­ten define the sub­sets in similar ways; so might be all en­vi­ron­ments with a cer­tain an­gu­lar mo­men­tum at time , while might be all en­vi­ron­ments with a cer­tain an­gu­lar mo­men­tum at a later time.

The full for­mal defi­ni­tion of these can be found here. The idea is to have a meta-model of mod­el­ling that is suffi­ciently gen­eral to ap­ply to al­most all mod­els, but not one that re­lies on some ideal or perfect for­mal­ism.

3.3 Bayesian mod­els within this meta-model

It’s very easy to in­clude Bayesian mod­els within this for­mal­ism. If we have a Bayesian model that in­cludes a set of wor­lds with prior , then we merely have to define a set of fea­tures that is suffi­cient to dis­t­in­guish all wor­lds in : each world is uniquely defined by its fea­ture val­ues[1]. Then we can define as , and on be­comes on ; the defi­ni­tions of terms like is just , per Bayes’ rules (un­less , in which case we set that to ).

4 Model re­fine­ment and splinterings

This sec­tion will look at what we can do with the pre­vi­ous meta-model, look­ing at re­fine­ment (how mod­els can im­prove) and splin­ter­ing (how im­prove­ments to the model can make some well-defined con­cepts less well-defined).

4.1 Model refinement

In­for­mally, is a re­fine­ment of model if it’s at least as ex­pres­sive as (it cov­ers the same en­vi­ron­ments) and is bet­ter ac­cord­ing to some crite­ria (sim­pler, or more ac­cu­rate in prac­tice, or some other mea­sure­ment).

At the tech­ni­cal level, we have a map from a sub­set of , that is sur­jec­tive onto . This cov­ers the “at least as ex­pres­sive” part: ev­ery en­vi­ron­ment in ex­ists as (pos­si­bly mul­ti­ple) en­vi­ron­ments in .

Then note that us­ing as a map from sub­sets of to sub­sets of , we can define on via:

Then this is a model re­fine­ment if is ‘at least as good as’ on , ac­cord­ing to our crite­ria[2].

4.2 Ex­am­ple of model re­fine­ment: gas laws

This post pre­sents some sub­classes of model re­fine­ment, in­clud­ing -im­prove­ments (same fea­tures, same en­vi­ron­ments, just a bet­ter ), or adding new fea­tures to a ba­sic model, called “non-in­de­pen­dent fea­ture ex­ten­sion” (eg adding clas­si­cal elec­tro­mag­netism to New­to­nian me­chan­ics).

Here’s a spe­cific gas law illus­tra­tion. Let be a model of an ideal gas, in some set of rooms and tubes. The con­sists of pres­sure, vol­ume, tem­per­a­ture, and amount of sub­stance, and is the ideal gas laws. The is the stan­dard con­di­tions for tem­per­a­ture and pres­sure, where the ideal gas law ap­plies. There are mul­ti­ple differ­ent types of gases in the world, but they all roughly obey the same laws.

Then com­pare with model . The has all the fea­tures of , but also in­cludes the vol­ume that is oc­cu­pied by one mole of the molecules of the given sub­stance. This al­lows to ex­press the more com­pli­cated van der Waals equa­tions, which are differ­ent for differ­ent types of gases. The can now track situ­a­tions where there are gases with differ­ent mo­lar vol­umes, which in­clude situ­a­tions where the van der Waals equa­tions differ sig­nifi­cantly from the ideal gas laws.

In this case , since we now dis­t­in­guish en­vi­ron­ments that we pre­vi­ously con­sid­ered iden­ti­cal (en­vi­ron­ments with same fea­tures ex­cept for hav­ing mo­lar vol­umes). The is just pro­ject­ing down by for­get­ting the mo­lar vol­ume. Then since (van der Waals equa­tions av­er­aged over the dis­tri­bu­tion of mo­lar vol­umes) is at least as ac­cu­rate as (ideal gas law), this is a re­fine­ment.

4.3 Ex­am­ple of model re­fine­ment: rubes and bleegs

Let’s reuse Eliezer’s ex­am­ple of rubes (“red cubes”) and bleggs (“blue eggs”).

Bleggs are blue eggs that glow in the dark, have a furred sur­face, and are filled with vana­dium. Rubes, in con­trast, are red cubes that don’t glow in the dark, have a smooth sur­face, and are filled with pal­la­dium:

Define by hav­ing , is the set of all bleggs and rubes in some situ­a­tion, and is rel­a­tively triv­ial: it pre­dicts that an ob­ject is red/​blue if and only if is smooth/​furred.

Define as a re­fine­ment of , by ex­pand­ing to . The pro­jec­tion is given by for­get­ting about those two last fea­tures. The is more de­tailed, as it now con­nects red-smooth-cube-dark to­gether, and similarly for blue-furred-egg-glows.

Note that is larger than , be­cause it in­cludes, e.g., en­vi­ron­ments where the cube ob­jects are blue. How­ever, all these ex­tra en­vi­ron­ments have prob­a­bil­ity zero.

4.4 Re­ward func­tion refactoring

Let be a re­ward func­tion on (by which we mean that is define on , the set of fea­tures in ), and a re­fine­ment of .

A re­fac­tor­ing of for is a re­ward func­tion on the fea­tures such that for any , .

For ex­am­ple, let and be from the rube/​blegg mod­els in the pre­vi­ous sec­tion. Let on sim­ply count the num­ber of rubes—or, more pre­cisely, counts the num­ber of ob­jects to which the fea­ture “red” ap­plies.

Let be the re­ward func­tion that counts the num­ber of ob­jects in to which “red” ap­plies. It’s clearly a re­fac­tor­ing of .

But so is , the re­ward func­tion that counts the num­ber of ob­jects in to which “smooth” ap­plies. In fact, the fol­low­ing is a re­fac­tor­ing of , for all :

There are also some non-lin­ear com­bi­na­tions of these fea­tures that re­fac­tor , and many other var­i­ants (like the strange com­bi­na­tions that gen­er­ate con­cepts like grue and bleen).

4.5 Re­ward func­tion splintering

Model splin­ter­ing, in the in­for­mal sense, is what hap­pens when we pass to a new mod­els in a way that the old fea­tures (or a re­ward func­tion defined by the old fea­tures) no longer ap­ply. It is similar to the web of con­no­ta­tions break­ing down, an agent go­ing out of dis­tri­bu­tion, or the defi­ni­tions of Rube and Blegg fal­ling apart.

  • Pre­limi­nary defi­ni­tion: If is a re­fine­ment of and a re­ward func­tion on , then splin­ters if there are mul­ti­ple re­fac­tor­ings of on that dis­agree on el­e­ments of of non-zero prob­a­bil­ity.

So, note that in the rube/​blegg ex­am­ple, is not a splin­ter­ing of : all the re­fac­tor­ings are the same on all bleggs and rubes—hence on all el­e­ments of of non-zero prob­a­bil­ity.

We can even gen­er­al­ise this a bit. Let’s as­sume that “red” and “blue” are not to­tally uniform; there ex­ists some rubes that are “re­dish-pur­ple”, while some bleggs are “blueish-pur­ple”. Then let be like , ex­cept the colour fea­ture can have four val­ues: “red”, “re­dish-pur­ple”, “blueish-pur­ple”, and “blue”.

Then, as long as rubes (defined, in this in­stance, by be­ing smooth-dark-cubes) are ei­ther “red” or “re­dish-pur­ple”, and the bleggs are “blue”, or “blueish-pur­ple”, then all re­fac­tor­ings of to agree—be­cause, on the test en­vi­ron­ment, on perfectly matches up with on .

So adding more fea­tures does not always cause splin­ter­ing.

4.6 Re­ward func­tion splin­ter­ing: “nat­u­ral” refactorings

The pre­limi­nary defi­ni­tion runs into trou­ble when we add more ob­jects to the en­vi­ron­ments. Define as be­ing the same as , ex­cept that con­tains one ex­tra ob­ject, ; apart from that, the en­vi­ron­ments typ­i­cally have a billion rubes and a trillion bleggs.

Sup­pose is a “furred-rube”, i.e. a red-furred-dark-cube. Then and are two differ­ent re­fac­tor­ings of , that ob­vi­ously dis­agree on any en­vi­ron­ment that con­tains . Even if the prob­a­bil­ity of is tiny (but non-zero), then splin­ters .

But things are worse than that. Sup­pose that is fully a rube: red-smooth-cube-dark, and even con­tains pal­la­dium. Define as be­ing count­ing the num­ber of red ob­jects, ex­cept for speci­fi­cally (again, this is similar to the grue and bleen ar­gu­ments against in­duc­tion).

Then both and are re­fac­tor­ings of , so still splin­ters , even when we add an­other ex­act copy of the el­e­ments in the train­ing set. Or even if we keep the train­ing set for a few ex­tra sec­onds, or add any change to the world.

So, for any a re­fine­ment of , and a re­ward func­tion on , let’s define “nat­u­ral re­fac­tor­ings” of :

  • The re­ward func­tion is a nat­u­ral re­fac­tor­ing of if it’s a re­ward func­tion on with:

  1. on , and

  2. can be defined sim­ply from and ,

  3. the them­selves are sim­ply defined.

This leads to a full defi­ni­tion of splin­ter­ing:

  • Full defi­ni­tion: If is a re­fine­ment of and a re­ward func­tion on , then splin­ters if 1) there are no nat­u­ral re­fac­tor­ing of on , or 2) there are mul­ti­ple nat­u­ral re­fac­tor­ings and of on , such that .

No­tice the whole host of caveats and weaselly terms here; , “sim­ply” (used twice), and . Sim­ply might mean al­gorith­mic sim­plic­ity, but and are mea­sures of how much “er­ror” we are will­ing to ac­cept in these re­fac­tor­ings. Given that, we prob­a­bly want to re­place and with some mea­sure of non-equal­ity, so we can talk about the “de­gree of nat­u­ral­ness” or the “de­gree of splin­ter­ing” of some re­fine­ment and re­ward func­tion.

Note also that:

  • Differ­ent choices of re­fine­ments can re­sult in differ­ent nat­u­ral re­fac­tor­ings.

An easy ex­am­ple: it makes a big differ­ence whether a new fea­ture is “tem­per­a­ture”, or “di­ver­gence from stan­dard tem­per­a­tures”.

4.7 Splin­ter­ing train­ing rewards

The con­cept of “re­ward re­fac­tor­ing” is tran­si­tive, but the con­cept of “nat­u­ral re­ward re­fac­tor­ing” need not be.

For ex­am­ple, let be a train­ing en­vi­ron­ment where red/​blue cube/​egg, and be a gen­eral en­vi­ron­ment where red/​blue is in­de­pen­dent of cube/​egg. Let be a fea­ture set with only red/​blue, and a fea­ture set with red/​blue and cube/​egg.

Then define as us­ing in the train­ing en­vi­ron­ment, as us­ing in the gen­eral en­vi­ron­ment; and are defined similarly.

For these mod­els, and are both re­fine­ments of , while is a re­fine­ment of all three other mod­els. Define as the “count red ob­jects” re­ward on . This has a nat­u­ral re­fac­tor­ing to on , which counts red ob­jects in the gen­eral en­vi­ron­ment.

And has a nat­u­ral re­fac­tor­ing to on , which still just counts the red ob­jects in the gen­eral en­vi­ron­ment.

But there is no nat­u­ral re­fac­tor­ing from di­rectly to . That’s be­cause, from ’s per­spec­tive, on might be count­ing red ob­jects, or might be count­ing cubes. This is not true for on , which is clearly only count­ing red ob­jects.

Thus when a re­ward func­tion come from a train­ing en­vi­ron­ment, we’d want our AI to look for splin­ter­ings di­rectly from a model of the train­ing en­vi­ron­ment, rather than from pre­vi­ous nat­u­ral re­fac­tor­ings.

4.8 Splin­ter­ing fea­tures and models

We can also talk about splin­ter­ing fea­tures and mod­els them­selves. For , the eas­iest way is to define a re­ward func­tion as be­ing the in­di­ca­tor func­tion for fea­ture be­ing in the set .

Then a re­fine­ment splin­ters the fea­ture if it splin­ters some .

The re­fine­ment splin­ters the model if it splin­ters at least one of its fea­tures.

For ex­am­ple, if is New­to­nian me­chan­ics, in­clud­ing “to­tal rest mass” and is spe­cial rel­a­tivity, then will splin­ter “to­tal rest mass”. Other ex­am­ples of fea­ture splin­ter­ing will be pre­sented in the rest of this post.

4.9 Pre­served back­ground features

A re­ward func­tion de­vel­oped in some train­ing en­vi­ron­ment will ig­nore any fea­ture that is always pre­sent or always ab­sent in that en­vi­ron­ment. This al­lows very weird situ­a­tions to come up, such as train­ing an AI to dis­t­in­guish happy hu­mans from sad hu­mans, and it end­ing up re­plac­ing hu­mans with hu­manoid robots (af­ter all, both happy and sad hu­mans were equally non-robotic, so there’s no rea­son not to do this).

Let’s try and do bet­ter than that. As­sume we have a model , with a re­ward func­tion defined on ( and can be seen as the train­ing data).

Then the fea­ture-pre­serv­ing re­ward func­tion , is a func­tion that con­strains the en­vi­ron­ments to have similar fea­ture dis­tri­bu­tions as and . There are many ways this could be defined; here’s one.

For an el­e­ment , just define

Ob­vi­ously, this can be im­proved; we might want to coarse-grain , group­ing to­gether similar wor­lds, and pos­si­bly bound­ing this be­low to avoid sin­gu­lar­i­ties.

Then we can use this to get the fea­ture-pre­serv­ing ver­sion of , which we can define as

for the max­i­mal value of on . Other op­tions can work as well, such as for some con­stant .

Then we can ask an AI to use as its re­ward func­tion, re­fac­tor­ing that, rather than .

  • A way of look­ing at it: a nat­u­ral re­fac­tor­ing of a re­ward func­tion will pre­serve all the im­plicit fea­tures that cor­re­late with . But will also pre­serve all the im­plicit fea­tures that stay con­stant when was defined. So if mea­sures hu­man hap­piness vs hu­man un­hap­piness, a nat­u­ral re­fac­tor­ing of it will pre­serves things like “hav­ing higher dopamine in their brain”. But a nat­u­ral re­fac­tor­ing of will also pre­serve things like “hav­ing a brain”.

4.10 Par­tially pre­served back­ground features

The is al­most cer­tainly too re­stric­tive to be of use. For ex­am­ple, if time is a fea­ture, then this will fall apart when the AI has to do some­thing af­ter the train­ing pe­riod. If all the hu­mans in a train­ing set share cer­tain fea­tures, hu­mans with­out those fea­tures will be pe­nal­ised.

There are at least two things we can do to im­prove this. The first is to in­clude more pos­i­tive and nega­tive ex­am­ples in the train­ing set; for ex­am­ple, if we in­clude hu­mans and robots in our train­ing set—as pos­i­tive and nega­tive ex­am­ples, re­spec­tively—then this differ­ence will show up in di­rectly, so we won’t need to use too much.

Another ap­proach would be to ex­plic­itly al­low cer­tain fea­tures to range be­yond their typ­i­cal val­ues in , or al­low highly cor­re­lated vari­ables ex­plic­itly to decor­re­late.

For ex­am­ple, though train­ing dur­ing a time pe­riod to , we could ex­plic­itly al­low time to range be­yond these val­ues, with­out penalty. Similarly, if a med­i­cal AI was trained on ex­am­ples of typ­i­cal healthy hu­mans, we could decor­re­late func­tion­ing di­ges­tion from brain ac­tivity, and get the AI to fo­cus on the sec­ond[3].

This has to be done with some care, as adding more de­grees of free­dom adds more ways for er­rors to hap­pen. I’m aiming to look fur­ther at this is­sue in later posts.

5 The fun­da­men­tal ques­tions of model re­fine­ments and splintering

We can now rephrase the out-of-dis­tri­bu­tion is­sues of sec­tion 1.1 in terms of the new for­mal­ism:

  1. When the AI re­fines its model, what would count as a nat­u­ral re­fac­tor­ing of its re­ward func­tion?

  2. If the re­fine­ments splin­ter its re­ward func­tion, what should the AI do?

  3. If the re­fine­ments splin­ter its re­ward func­tion, and also splin­ters the hu­man’s re­ward func­tion, what should the AI do?

6 Ex­am­ples and applications

The rest of this post is ap­ply­ing this ba­sic frame­work, and its ba­sic in­sights, to var­i­ous com­mon AI safety prob­lems and analy­ses. This sec­tion is not par­tic­u­larly struc­tured, and will range widely (and wildly) across a va­ri­ety of is­sues.

6.1 Ex­tend­ing be­yond the train­ing distribution

Let’s go back to the blegg and rube ex­am­ples. A hu­man su­per­vises an AI in a train­ing en­vi­ron­ment, la­bel­ling all the rubes and bleggs for it.

The hu­man is us­ing a very sim­ple model, with the only fea­ture be­ing the colour of the ob­ject, and be­ing the train­ing en­vi­ron­ment.

Mean­while the AI, hav­ing more ob­ser­va­tional abil­ities and no filter as to what can be ig­nored, no­tices their colour, their shape, their lu­mi­nance, and their tex­ture. It doesn’t know , but is us­ing model , where cov­ers those four fea­tures (note that is a re­fine­ment of , but that isn’t rele­vant here).

Sup­pose that the AI is trained to be rube-clas­sifier (and hence a blegg clas­sifier by de­fault). Let be the re­ward func­tion that counts the num­ber of ob­jects, with fea­ture , that the AI has clas­sified as rubes. Then the AI could learn many differ­ent re­ward func­tion in the train­ing en­vi­ron­ment; here’s one:

Note that, even though this gets the colour re­ward com­pletely wrong, this re­ward matches up with the hu­man’s as­sess­ment on the train­ing en­vi­ron­ment.

Now the AI moves to the larger test­ing en­vi­ron­ment , and re­fines its model min­i­mally to (ex­tend­ing to in the ob­vi­ous way).

In , the AI some­times en­coun­ters ob­jects that it can only see through their colour. Will this be a prob­lem, since the colour com­po­nent of is point­ing in the wrong di­rec­tion?

No. It still has , and can de­duce that a red ob­ject must be cube-smooth-dark, so will con­tinue treat­ing this as a rube[4].

6.2 De­tect­ing go­ing out-of-distribution

Now imag­ine the AI learns about the con­tent of the rubes and bleggs, and so re­fines to a new model that in­cludes vana­dium/​pal­la­dium as a fea­ture in .

Fur­ther­more, in the train­ing en­vi­ron­ment, all rubes have pal­la­dium and all bleggs have vana­dium in them. So, for a re­fine­ment of , has only pal­la­dium-rubes and vana­dium-bleggs. But in , the full en­vi­ron­ment, there are rather a lot of rubes with vana­dium and bleggs with pal­la­dium.

So, similarly to sec­tion 4.7, there is no nat­u­ral re­fac­tor­ing of the rube/​blegg re­ward in , to . That’s be­cause , the fea­ture set of , in­cludes vana­dium/​pal­la­dium which co-vary with the other rube/​blegg fea­tures on the train­ing en­vi­ron­ment (q^{-1}(\E_{AI}^1)), but not on the full en­vi­ron­ment of .

So look­ing for re­ward splin­ter­ing from the train­ing en­vi­ron­ment is a way of de­tect­ing go­ing out-of-dis­tri­bu­tion—even on fea­tures that were not ini­tially de­tected in the train­ing dis­tri­bu­tion, by ei­ther the hu­man nor the AI.

6.3 Ask­ing hu­mans and Ac­tive IRL

Some of the most promis­ing AI safety meth­ods to­day rely on get­ting hu­man feed­back[5]. Since hu­man feed­back is ex­pen­sive, as in it’s slow and hard to get com­pared with al­most all other as­pects of al­gorithms, peo­ple want to get this feed­back in the most effi­cient ways pos­si­ble.

A good way of do­ing this would be to ask for feed­back when the AI’s cur­rent re­ward func­tion splin­ters, and mul­ti­ple op­tions are pos­si­ble.

A more rigor­ous anal­y­sis would look at the value of in­for­ma­tion, ex­pected fu­ture splin­ter­ings, and so on. This is what they do in Ac­tive In­verse Re­in­force­ment Learn­ing; the main differ­ence is that AIRL em­pha­sises an un­known re­ward func­tion with hu­mans pro­vid­ing in­for­ma­tion, while this ap­proach sees it more as an known re­ward func­tion over un­cer­tain fea­tures (or over fea­tures that may splin­ter in gen­eral en­vi­ron­ments).

6.4 A time for conservatism

I ar­gued that many “con­ser­va­tive” AI op­ti­mis­ing ap­proaches, such as quan­tiliz­ers and pes­simistic AIs, don’t have a good mea­sure of when to be­come more con­ser­va­tive; their pa­ram­e­ters and don’t en­code use­ful guidelines for the right de­gree of con­ser­vatism.

In this frame­work, the al­ter­na­tive is ob­vi­ous: AIs should be­come con­ser­va­tive when their re­ward func­tions splin­ter (mean­ing that the re­ward func­tion com­pat­i­ble with the pre­vi­ous en­vi­ron­ment has mul­ti­ple nat­u­ral re­fac­tor­ings), and very con­ser­va­tive when they splin­ter a lot.

This de­sign is very similar to In­verse Re­ward De­sign. In that situ­a­tion, the re­ward sig­nal in the train­ing en­vi­ron­ment is taken as in­for­ma­tion about the “true” re­ward func­tion. Ba­si­cally they take all re­ward func­tions that could have given the spe­cific re­ward sig­nals, and as­sume the “true” re­ward func­tion is one of them. In that pa­per, they ad­vo­cate ex­treme con­ser­vatism at that point, by op­ti­mis­ing the min­i­mum of all pos­si­ble re­ward func­tions.

The idea here is al­most the same, though with more em­pha­sis on “hav­ing a true re­ward defined on un­cer­tain fea­tures”. Hav­ing mul­ti­ple con­tra­dic­tory re­ward func­tions com­pat­i­ble with the in­for­ma­tion, in the gen­eral en­vi­ron­ment, is equiv­a­lent with hav­ing a lot of splin­ter­ing of the train­ing re­ward func­tion.

6.5 Avoid­ing am­bigu­ous dis­tant situations

The post “By de­fault, avoid am­bigu­ous dis­tant situ­a­tions” can be rephrased as: let be a model in which we have a clear re­ward func­tion , and let be a re­fine­ment of this to gen­eral situ­a­tions. We ex­pect that this re­fine­ment splin­ters . Let be like , ex­cept with smaller than , defined such that:

  1. An AI could be ex­pected to be able to con­strain the world to be in , with high prob­a­bil­ity,

  2. The is not a splin­ter­ing of .

Then that post can be sum­marised as:

  • The AI should con­strain the world to be in and then max­imise the nat­u­ral re­fac­tor­ing of in .

6.6 Ex­tra variables

Stu­art Rus­sell writes:

A sys­tem that is op­ti­miz­ing a func­tion of vari­ables, where the ob­jec­tive de­pends on a sub­set of size , will of­ten set the re­main­ing un­con­strained vari­ables to ex­treme val­ues; if one of those un­con­strained vari­ables is ac­tu­ally some­thing we care about, the solu­tion found may be highly un­de­sir­able.

The ap­proach in sec­tions 4.9 and 4.10 ex­plic­itly deals with this.

6.7 Hid­den (dis)agree­ment and interpretability

Now con­sider two agents do­ing a rube/​blegg clas­sifi­ca­tions task in the train­ing en­vi­ron­ment; each agent only mod­els two of the fea­tures:

De­spite not hav­ing a sin­gle fea­ture in com­mon, both agents will agree on what bleggs and rubes are, in the train­ing en­vi­ron­ment. And when re­fin­ing to a ful­ler model that in­cludes all four (or five) of the key fea­tures, both agents will agree as to whether a nat­u­ral re­fac­tor­ing is pos­si­ble or not.

This can be used to help define the limits of in­ter­pretabil­ity. The AI can use its own model, and its own de­signed fea­tures, to define the cat­e­gories and re­wards in the train­ing en­vi­ron­ment. Th­ese need not be hu­man-parsable, but we can at­tempt to in­ter­pret them in hu­man terms. And then we can give this in­ter­pre­ta­tion to the AI, as a list of pos­i­tive and nega­tive ex­am­ples of our in­ter­pre­ta­tion.

If we do this well, the AI’s own fea­tures and our in­ter­pre­ta­tion will match up in the train­ing en­vi­ron­ment. But as we move to more gen­eral en­vi­ron­ments, these may di­verge. Then the AI will flag a “failure of in­ter­pre­ta­tion” when its re­fac­tor­ing di­verges from a re­fac­tor­ing of our in­ter­pre­ta­tion.

For ex­am­ple, if we think the AI de­tects pan­das by look­ing for white hair on the body, and black hair on the arms, we can flag lots of ex­am­ples of pan­das and that hair pat­tern (and non-pan­das and un­usual hair pat­terns. We don’t use these ex­am­ples for train­ing the AI, just to con­firm that, in the train­ing en­vi­ron­ment, there is a match be­tween “AI-thinks-they-are-pan­das” and “white-hair-on-arms-black-hair-on-bod­ies”.

But, in an ad­ver­sar­ial ex­am­ple, the AI could de­tect that, while it is de­tect­ing gib­bons, this no longer matches up with our in­ter­pre­taion. A splin­ter­ing of in­ter­pre­ta­tions, if you want.

6.8 Wireheading

The ap­proach can also be used to de­tect wire­head­ing. Imag­ine that the AI has var­i­ous de­tec­tors that al­low it to la­bel what the fea­tures of the bleggs and rubes are. It mod­els the world with ten fea­tures: fea­tures rep­re­sent­ing the “real world” ver­sions of the fea­tures, and rep­re­sent­ing the “this sig­nal comes from my de­tec­tor” ver­sions.

This gives a to­tal of fea­tures, the fea­tures “in the real world” and the “AI-la­bel­led” ver­sions of these:

In the train­ing en­vi­ron­ment, there was full over­lap be­tween these fea­tures, so the AI might learn the in­cor­rect “max­imise my la­bels/​de­tec­tor sig­nal” re­ward.

How­ever, when it re­fines its model to all fea­tures and en­vi­ron­ments where la­bels and un­der­ly­ing re­al­ity di­verge, it will re­al­ise that this splin­ters the re­ward, and thus de­tect a pos­si­ble wire­head­ing. It could then ask for more in­for­ma­tion, or have an au­to­mated “don’t wire­head” ap­proach.

6.9 Hy­po­thet­i­cals, and train­ing in vir­tual environments

To get around the slow­ness of the real world, some ap­proaches train AIs in vir­tual en­vi­ron­ments. The prob­lem is to pass that learn­ing from the vir­tual en­vi­ron­ment to the real one.

Some have sug­gested mak­ing the vir­tual en­vi­ron­ment suffi­ciently de­tailed that the AI can’t tell the differ­ence be­tween it and the real world. But, a) this in­volves fool­ing the AI, an ap­proach I’m always wary of, and b) it’s un­nec­es­sary.

Within the meta-for­mal­ism of this post, we could train the AI in a vir­tual en­vi­ron­ment which it mod­els by , and let it con­struct a model of the real-world. We would then mo­ti­vate the AI to find the “clos­est match” be­tween and , in terms of fea­tures and how they con­nect and vary. This is similar to how we can train pi­lots in flight simu­la­tors; the pi­lots are never un­der any illu­sion as to whether this is the real world or not, and even crude simu­la­tors can al­low them to build cer­tain skills[6].

This can also be used to al­low the AI to de­duce in­for­ma­tion from hy­po­thet­i­cals and thought ex­per­i­ments. If we show the AI an epi­sode of a TV se­ries show­ing peo­ple be­hav­ing morally (or im­morally), then the epi­sode need not be be­liev­able or plau­si­ble, if we can roughly point to the fea­tures in the epi­sode that we want to em­pha­sise, and roughly how these re­late to real-world fea­tures.

6.10 Defin­ing how to deal with mul­ti­ple plau­si­ble refactorings

The ap­proach for syn­the­sis­ing hu­man prefer­ences, defined here, can be rephrased as:

  • “Given that we ex­pect mul­ti­ple nat­u­ral re­fac­tor­ings of hu­man prefer­ences, and given that we ex­pect some of them to go dis­as­trously wrong, here is one way of re­solv­ing the splin­ter­ing that we ex­pect to be bet­ter than most.”

This is just one way of do­ing this, but it does show that “au­tomat­ing what AIs do with mul­ti­ple re­fac­tor­ings” might not be im­pos­si­ble. The fol­low­ing sub­sec­tion has some ideas with how to deal with that.

6.11 Global, large scale preferences

In an old post, I talked about the con­cept of “emer­gency learn­ing”, which was ba­si­cally, “lots of ex­am­ples, and all the stuff we know and sus­pect about how AIs can go wrong, shove it all in, and hope for the best”. The “shove it all in” was a bit more struc­tured than that, defin­ing large scale prefer­ences (like “avoid siren wor­lds” and “don’t over-op­ti­mise”) as con­straints to be added to the learn­ing pro­cess.

It seems we can do bet­ter than that here. Us­ing ex­am­ples and hy­po­thet­i­cals, it seems we could con­struct ideas like “avoid slav­ery”, “avoid siren wor­lds”, or “don’t over-op­ti­mise” as re­wards or pos­i­tive/​nega­tive ex­am­ples cer­tain sim­ple train­ing en­vi­ron­ments, so that the AI “gets an idea of what we want”.

We can then la­bel these ideas as “global prefer­ences”. The idea is that they start as loose re­quire­ments (we have much more gran­u­lar hu­man-scale prefer­ences than just “avoid slav­ery”, for ex­am­ple), but, the more the world di­verges from the train­ing en­vi­ron­ment, the stric­ter they are to be in­ter­preted, with the AI re­quired to re­spect some soft­min of all nat­u­ral re­fac­tor­ings of these fea­tures.

In a sense, we’d be say­ing “pre­vent slav­ery; these are the fea­tures of slav­ery, and in weird wor­lds, be es­pe­cially wary of these fea­tures”.

6.12 Avoid­ing side-effects

Krakovna et. al. pre­sented a pa­per on avoid­ing side-effects from AI. The idea is to have an AI max­imis­ing some re­ward func­tion, while re­duc­ing side effects. So the AI would not smash vases or let them break, nor would it pre­vent hu­mans from eat­ing sushi.

In this en­vi­ron­ment, we want the AI to avoid knock­ing the sushi off the belt as it moves:

Here, in con­trast, we’d want the AI to re­move the vase from the belt be­fore it smashes:

I pointed out some is­sues with the whole ap­proach. Those is­sues were phrased in terms of sub-agents, but my real in­tu­ition is that syn­tac­tic meth­ods are not suffi­cient to con­trol side effects. In other words, the AI can’t learn to do the right thing with sushis and vases, un­less it has some idea of what these ob­jects mean to us; we pre­fer sushis to be eaten and vases to not be smashed.

This can be learnt if the AI has a enough train­ing ex­am­ples, learn­ing that eat­ing sushi is a gen­eral fea­ture of the en­vi­ron­ments it op­er­ates in, while vases be­ing smashed is not. I’ll re­turn to this idea in a later post.

6.13 Cancer patients

The ideas of this post were pre­sent in im­plicit form in the idea of train­ing an AI to cure can­cer pa­tients.

Us­ing ex­am­ples of suc­cess­fully treated can­cer pa­tients, we noted they all shared some pos­i­tive fea­tures (re­cu­per­at­ing, liv­ing longer) and some in­ci­den­tal or nega­tive fea­tures (com­plain­ing about pain, pay­ing more taxes).

So, us­ing the ap­proach of sec­tion 4.9, we can des­ig­nate that we want the AI to cure can­cer; this will be in­ter­preted as in­creas­ing all the fea­tures that cor­re­late with that.

Us­ing the ex­plicit decor­re­la­tion of sec­tion 4.10, we can also ex­plic­itly re­move the nega­tive op­tions from the de­sired fea­ture sets, thus im­prov­ing the out­comes even more.

6.14 The ge­nie and the burn­ing mother

In Eliezer’s origi­nal post on the hid­den com­plex­ity of wishes, he talks of the challenge of get­ting a ge­nie to save your mother from a burn­ing build­ing:

So you hold up a photo of your mother’s head and shoulders; match on the photo; use ob­ject con­ti­guity to se­lect your mother’s whole body (not just her head and shoulders); and define the fu­ture func­tion us­ing your mother’s dis­tance from the build­ing’s cen­ter. [...]

You cry “Get my mother out of the build­ing!”, for luck, and press En­ter. [...]

BOOM! With a thun­der­ing roar, the gas main un­der the build­ing ex­plodes. As the struc­ture comes apart, in what seems like slow mo­tion, you glimpse your mother’s shat­tered body be­ing hurled high into the air, trav­el­ing fast, rapidly in­creas­ing its dis­tance from the former cen­ter of the build­ing.

How could we avoid this? What you want is your mother out of the build­ing. The fea­ture “mother in build­ing” must ab­solutely be set to false; this is a pri­or­ity call, over­rid­ing al­most ev­ery­thing else.

Here we’d want to load ex­am­ples of your mother out­side the build­ing, so that the ge­nie/​AI learns the fea­tures “mother in house”/​”mother out of house”. Then it will note that “mother out of house” cor­re­lates with a whole lot of other fea­tures—like mother be­ing al­ive, breath­ing, pain-free, of­ten awake, and so on.

All those are good things. But there are some other fea­tures that don’t cor­re­late so well—such as the time be­ing ear­lier, your mother not re­mem­ber­ing a fire, not be­ing cov­ered in soot, not wor­ried about her burn­ing house, and so on.

As in the can­cer pa­tient ex­am­ple above, we’d want to pre­serve the fea­tures that cor­re­late with the mother out of the house, while al­low­ing decor­re­la­tion with the fea­tures we don’t care about or don’t want to pre­serve.

6.15 Splin­ter­ing moral-rele­vant cat­e­gories: hon­our, gen­der, and happiness

If the An­tikythera mechanism had been com­bined with the Ae­olipile to pro­duce an an­cient Greek AI, and Homer had pro­grammed it (among other things) to “in­crease peo­ple’s hon­our”, how badly would things have gone?

If Bab­bage had com­pleted the an­a­lyt­i­cal en­g­ine as Vic­to­rian AI, and pro­grammed it (among other things) to “pro­tect women”, how badly would things have gone?

If a mod­ern pro­gram­mer were to com­bine our neu­ral nets into a su­per­in­tel­li­gence and pro­gram it (among other things) to “in­crease hu­man hap­piness”, how badly will things go?

There are three moral-rele­vant cat­e­gories here, and it’s illus­tra­tive to com­pare them: hon­our, gen­der, and he­do­nic hap­piness. The first has splin­tered, the sec­ond is splin­ter­ing, and the third will likely splin­ter in the fu­ture.

I’m not pro­vid­ing solu­tions in this sub­sec­tion, just look­ing at where the prob­lems can ap­pear, and en­courag­ing peo­ple to think about how they would have ad­vised Homer or Bab­bage to define their con­cepts. Don’t think “stop us­ing your con­cepts, use ours in­stead”, be­cause our con­cepts/​fea­tures will splin­ter too. Think “what’s the best way they could have ex­tended their prefer­ences even as the fea­tures splin­ter”?

  • 6.15.1 Honour

If we look at the con­cept of hon­our, we see a con­cept that has already splin­tered.

That ar­ti­cle reads like a me­an­der­ing mess. Honour is “face”, “rep­u­ta­tion”, a “bond be­tween an in­di­vi­d­ual and a so­ciety”, “re­ciproc­ity”, a “code of con­duct”, “chastity” (or “virginity”), a “right to prece­dence”, “no­bil­ity of soul, mag­na­n­im­ity, and a scorn of mean­ness”, “vir­tu­ous con­duct and per­sonal in­tegrity”, “vengeance”, “cred­i­bil­ity”, and so on.

What a bas­ket of con­cepts! They only seem vaguely con­nected to­gether; and even places with strong hon­our cul­tures differ in how they con­ceive of hon­our, from place to place and from epoch to epoch[7]. And yet, if you asked most peo­ple within those cul­tures about what hon­our was, they would have had a strong feel­ing it was a sin­gle, well defined thing, maybe even a con­crete ob­ject.

  • 6.15.2 Gender

In his post the cat­e­gories were made for man, not man for the cat­e­gories, Scott writes:

Ab­solutely typ­i­cal men have Y chro­mo­somes, have male gen­i­talia, ap­pre­ci­ate manly things like sports and lum­ber­jack­ery, are ro­man­ti­cally at­tracted to women, per­son­ally iden­tify as male, wear male cloth­ing like blue jeans, sing bar­i­tone in the opera, et cetera.

But Scott is writ­ing this in the 21st cen­tury, long af­ter the gen­der defi­ni­tion has splin­tered quite a bit. In mid­dle class mid­dle class Vic­to­rian England[8], the gen­der di­vide was much stronger—in that, from one com­po­nent of the di­vide, you could pre­dict a lot more. For ex­am­ple, if you knew some­one wore dresses in pub­lic, you knew that, al­most cer­tainly, they couldn’t own prop­erty if they were mar­ried, nor could they vote, they would be ex­pected to be in charge of the house­hold, might be al­lowed to faint, and were ex­pected to guard their virginity.

We talk nowa­days about gen­der roles mul­ti­ply­ing or be­ing harder to define, but they’ve ac­tu­ally be­ing splin­ter­ing for a lot longer than that. Even though we could define two gen­ders in 1960s Bri­tain, at least roughly, that defi­ni­tion was a lot less in­for­ma­tive than it was in Vic­to­rian-mid­dle-class-Bri­tain times: it had many fewer fea­tures strongly cor­re­lated with it.

  • 6.15.3 Happiness

On to hap­piness! Philoso­phers and oth­ers have been talk­ing about hap­piness for cen­turies, of­ten con­trast­ing “true hap­piness”, or flour­ish­ing, with he­do­nism, or drugged out stu­por, or things of that na­ture. Often “true hap­piness” is a life of duty to what the philoso­pher wants to hap­pen, but at least there is some anal­y­sis, some break­down of the “hap­piness” fea­ture into smaller com­po­nent parts.

Why did the philoso­phers do this? I’d wa­ger that it’s be­cause the con­cept of hap­piness was already some­what splin­tered (as com­pared with a model where “hap­piness” is a sin­gle thing). Those philoso­phers had ex­pe­rience of joy, plea­sure, the satis­fac­tion of a job well done, con­nec­tion with oth­ers, as well as su­perfi­cial highs from tem­po­rary feel­ings. When they sat down to sys­tem­a­tise “hap­piness”, they could draw on the fea­tures of their own men­tal model. So even if peo­ple hadn’t sys­tem­a­tised hap­piness them­selves, when they heard of what philoso­phers were do­ing, they prob­a­bly didn’t re­act as “What? Drunken he­do­nism and in­tel­lec­tual joy are not the same thing? How dare you say such a thing!”

But look­ing into the fu­ture, into a world that an AI might cre­ate, we can fore­see many situ­a­tions where the im­plicit as­sump­tions of hap­piness come apart, and only some re­main. I say “we can fore­see”, but it’s ac­tu­ally very hard to know ex­actly how that’s go­ing to hap­pen; if we knew it ex­actly, we could solve the is­sues now.

So, imag­ine a happy per­son. What do you think that they have in life, that are not triv­ial syn­onyms of hap­piness? I’d imag­ine they have friends, are healthy, think in­ter­est­ing thoughts, have some free­dom of ac­tion, may work on worth­while tasks, may be con­nected with their com­mu­nity, prob­a­bly make peo­ple around them happy as well. Get­ting a bit less an­thro­po­mor­phic, I’d also ex­pect them to be a car­bon-based life-form, to have a rea­son­able mix of hor­mones in their brain, to have a con­ti­nu­ity of ex­pe­rience, to have a sense of iden­tity, to have a per­son­al­ity, and so on.

Now, some of those fea­tures can clearly be sep­a­rated from “hap­piness”. Even ahead of time, I can con­fi­dently say that “be­ing a car­bon-based life-form” is not go­ing to be a crit­i­cal fea­ture of “hap­piness”. But many of the other ones are not so clear; for ex­am­ple, would some­one with­out con­ti­nu­ity of ex­pe­rience or a sense of iden­tity be “happy”?

Of course, I can’t an­swer that ques­tion. Be­cause the ques­tion has no an­swer. We have our cur­rent model of hap­piness, which co-varies with all those fea­tures I listed and many oth­ers I haven’t yet thought of. As we move into more and more bizarre wor­lds, that model will splin­ter. And whether we as­sign the differ­ent fea­tures to “hap­piness” or to some other con­cept, is a choice we’ll make, not a well-defined solu­tion to a well-defined prob­lem.

How­ever, even at this stage, some an­swers are clearly bet­ter than oth­ers; stat­ues of happy peo­ple should not count, for ex­am­ple, nor should writ­ten sto­ries de­scribing very happy peo­ple.

6.16 Ap­pren­tice­ship learning

In ap­pren­tice­ship learn­ing (or learn­ing from demon­stra­tion), the AI would aim to copy what ex­perts have done. In­verse re­in­force­ment learn­ing can be used for this pur­pose, by guess­ing the ex­pert’s re­ward func­tion, based on their demon­stra­tions. It looks for key fea­tures in ex­pert tra­jec­to­ries and at­tempts to re­pro­duce them.

So, if we had an au­to­matic car driv­ing peo­ple to the air­port, and fed it some tra­jec­to­ries (maybe ranked by speed of de­liv­ery), it would no­tice that pas­sen­gers would also ar­rive al­ive, with their bags, with­out be­ing pur­sued by the po­lice, and so on. This is akin to sec­tion 4.9, and would not ac­cel­er­ate blindly to get there as fast as pos­si­ble.

But the al­gorithm has trou­ble get­ting to truly su­per-hu­man perfor­mance[9]. It’s far too con­ser­va­tive, and, if we loosen the con­ser­vatism, it doesn’t know what’s ac­cept­able and what isn’t, and how to trade these off: since all pas­sen­gers sur­vived and the car was always painted yel­low, their lug­gage in­tact in the train­ing data, it has no rea­son to pre­fer hu­man sur­vival to taxi-colour. It doesn’t even have a rea­son to have a spe­cific fea­ture re­sem­bling “pas­sen­ger sur­vived” at all.

This might be im­proved by the “al­low decor­re­la­tion” ap­proach from sec­tion 4.10: we speci­fi­cally al­low it to max­imise speed of trans­port, while keep­ing the other fea­tures (no ac­ci­dents, no speed­ing tick­ets) in­tact. As in sec­tion 6.7, we’ll at­tempt to check that the AI does pri­ori­tise hu­man sur­vival, and that it will warn us if a re­fac­tor­ing moves it away from this.


  1. Now, some­times wor­lds may be in­dis­t­in­guish­able for any fea­ture set. But in that case, they can’t be dis­t­in­guished by any ob­ser­va­tions, ei­ther, so their rel­a­tive prob­a­bil­ities won’t change: as long as it’s defined, is con­stant for all ob­ser­va­tions . So we can re­place and with , of prior prob­a­bil­ity . Do­ing this for all in­dis­t­in­guish­able wor­lds (which form an equiv­alence class) gives , a set of dis­t­in­guish­able wor­lds, with a well defined on it. ↩︎

  2. It’s use­ful to con­trast a re­fine­ment with the “ab­strac­tion” defined in this se­quence. An ab­strac­tion throws away ir­rele­vant in­for­ma­tion, so is not gen­er­ally a re­fine­ment. Some­times they are ex­act op­po­sites, as the ideal gas law is an ab­strac­tion of the move­ment of all the gas par­ti­cles, while the op­po­site would be a re­fine­ment.

    But they are ex­act op­po­sites ei­ther. Start­ing with the neu­rons of the brain, you might ab­stract them to “emo­tional states of mind”, while a re­fine­ment could also add “emo­tional states of mind” as new fea­tures (while also keep­ing the old fea­tures). A splin­ter­ing is more the op­po­site of an ab­strac­tion, as it sig­nals that the old ab­strac­tion fea­tures are not suffi­cient.

    It would be in­ter­est­ing to ex­plore some of the con­cepts in this post with a mix­ture of re­fine­ments (to get the fea­tures we need) and ab­strac­tions (to sim­plify the mod­els and get rid of the fea­tures we don’t need), but that is be­yond the scope of this cur­rent, already over-long, post. ↩︎

  3. Speci­fi­cally, we’d point—via la­bel­led ex­am­ples—at a clusters of fea­tures that cor­re­late with func­tion­ing di­ges­tion, and an­other cluster of fea­tures that cor­re­late with brain ac­tivity, and al­low those two clusters to decor­re­late with each other. ↩︎

  4. It is no co­in­ci­dence that, if and are re­wards on , that are iden­ti­cal on , and if is a re­fac­tor­ing of , then is also a re­fac­tor­ing of . ↩︎

  5. Though note there are some prob­lems with this ap­proach, both in the­ory and in prac­tice. ↩︎

  6. Some more “body in­stincts” skills re­quire more re­al­is­tic en­vi­ron­ments, but some skills and pro­ce­dures can perfectly well be trained in min­i­mal simu­la­tors. ↩︎

  7. You could define hon­our as “be­haves ac­cord­ing to the im­plicit ex­pec­ta­tions of their so­ciety”, but that just illus­trates how time-and-place de­pen­dent hon­our is. ↩︎

  8. Pre 1870. ↩︎

  9. It’s not im­pos­si­ble to get su­per­hu­man perfor­mance from ap­pren­tice­ship learn­ing; for ex­am­ple, we could se­lect the best hu­man perfor­mance on a col­lec­tion of dis­tinct tasks, and thus get the al­gorithm to have a over­all perfor­mance that no hu­man could ever match. In­deed, one of the pur­poses of task de­com­po­si­tion is to de­com­pose com­plex tasks in ways that al­low ap­pren­tice­ship-like learn­ing to have safe and very su­per­hu­man perfor­mance on the whole task. ↩︎