Goal completion: noise, errors, bias, prejudice, preference and complexity

A pu­ta­tive new idea for AI con­trol; in­dex here.

This is a pre­limi­nary look at how an AI might as­sess and deal with var­i­ous types of er­rors and un­cer­tain­ties, when es­ti­mat­ing true hu­man prefer­ences. I’ll be us­ing the cir­cu­lar rocket model to illus­trate how these might be dis­t­in­guished by an AI. Re­call that the rocket can ac­cel­er­ate by −2, −1, 0, 1, and 2, and the hu­man wishes to reach the space sta­tion (at point 0 with ve­loc­ity 0) and avoid ac­cel­er­a­tions of ±2. In the forth­com­ing, there will gen­er­ally be some noise, so to make the whole thing more flex­ible, as­sume that the space sta­tion is a bit big­ger than usual, cov­er­ing five squares. So “dock­ing” at the space sta­tion means reach­ing {-2,-1,0,1,2} with 0 ve­loc­ity.

The pur­pose of this ex­er­cise is to dis­t­in­guish true prefer­ences from other things that might seem to be prefer­ences from the out­side, but aren’t. Ul­ti­mately, if this works, we should be able to con­struct an al­gorithm that iden­ti­fies prefer­ences, such that any­thing it re­jects is at least ar­guably not a prefer­ence.
So, at least ini­tially, I’ll be iden­ti­fy­ing terms like “bias” with “fits into the tech­ni­cal defi­ni­tion of bias used in this model”; once the defi­ni­tions are re­fined, we can then check whether they cap­ture enough of the con­cepts we wand.
So I’m go­ing to use the fol­low­ing terms to dis­t­in­guish var­i­ous tech­ni­cal con­cepts in this do­main:
  1. Noise

  2. Error

  3. Bias

  4. Prejudices

  5. Known prejudices

  6. Preferences

Here, noise is when the ac­tion se­lected by the hu­man doesn’t lead to the cor­rect ac­tion out­put. Er­ror is when the hu­man se­lects the wrong ac­tion. Bias is when the hu­man is fol­low­ing the wrong plan. A prej­u­dice is a prefer­ence the hu­man has that they would not agree upon if it was brought to their con­scious no­tice. A know prej­u­dice is a prej­u­dice the hu­man knows about, but can’t suc­cess­fully cor­rect within them­selves.
And a prefer­ence is a prefer­ence.
What char­ac­ter­is­tics would al­low the AI to dis­t­in­guish be­tween these? Note that part of the rea­son that the terms are non-stan­dard is that I’m not start­ing with perfectly clear con­cepts and at­tempt­ing to dis­t­in­guish them; in­stead, I’m find­ing way of dis­t­in­guish­ing var­i­ous con­cepts, and see­ing if these map on well to the con­cepts we care about.

Noise ver­sus prefer­ence and complexity

In this model, noise is seen as 5% chance of un­der-ac­cel­er­at­ing, ie a de­sired ac­cel­er­a­tion of ±2 will, 5% of the time, give an ac­cel­er­a­tion of ±1. And a de­sired ac­cel­er­a­tion of ±1 will, 5% of the time, give an ac­cel­er­a­tion of zero.
The hu­man starts from a po­si­tion where, to reach the space sta­tion at zero ve­loc­ity, the best plan is to ac­cel­er­ate for a long while and de­cel­er­ate (by −1) for two turns. Ac­cel­er­at­ing by −2 once would also do the trick, though the hu­man prefers not to do that, ob­vi­ously.
As in the pre­vi­ous post, the AI has cer­tain fea­tures φi to ex­plain the hu­man’s be­havi­our. They are:
* φ0({-2,1,0,1,2},0;-)=1
* φ1(-,-;-)=1
* φ2(-,-;+2)=1
* φ3(-,-;-2)=1
The first fea­ture in­di­cates that the five-square space-sta­tion is in a spe­cial po­si­tion (if the ve­loc­ity is 0). The sec­ond fea­ture (a uni­ver­sal fea­ture) is used to show that wast­ing time is not benefi­cial. The third fea­ture is used to rule out ac­cel­er­a­tions of +2, the last fea­ture those of −2.
Given the tra­jec­tory it’s seen, the AI can con­fi­dently fit φ0, φ1 and φ2 to some es­ti­mate of true re­wards (the hu­man rushes to the space sta­tion, with­out us­ing +2 ac­cel­er­a­tions). How­ever, it doesn’t know what to do with φ3. The hu­man had an op­por­tu­nity to use −2 ac­cel­er­a­tion, but went for two −1s in­stead. There are two op­tions: the hu­man ac­tu­ally wants to avoid −2 ac­cel­er­a­tions, and ev­ery­thing went well. Or the hu­man doesn’t want to avoid them, but the noise forced their de­sired −2 ac­cel­er­a­tion down to −1.
Nor­mally there would be a com­plex­ity prior here, with a three-fea­ture ex­pla­na­tion be­ing the most likely—pos­si­bly still the most likely af­ter mul­ti­ply­ing it by 5% to ac­count for the noise. How­ever, there is a re­cur­ring risk that the AI will un­der­es­ti­mate the com­plex­ity of hu­man de­sires. One way of com­bat­ing this is to not use a com­plex­ity prior, at least up to some “rea­son­able” size of hu­man de­sires. If the four-fea­ture ex­pla­na­tion has more prior weight than the three-fea­ture one, then the φ3 is likely to be used and the AI will see the −1,-1 se­quence as de­liber­ate, not noise.
A warn­ing, how­ever: hu­mans have com­plex prefer­ences, but those prefer­ences are not rele­vant to ev­ery sin­gle situ­a­tion. What about φ4, the hu­man prefer­ence for choco­late, φ5, the hu­man prefer­ence for di­alogue in movies, and φ6, the hu­man prefer­ence for sun­light? None of them would ap­pear di­rectly in this sim­ple model of the rocket equa­tion. And though φ0-φ1-φ2-φ3 is a four-fea­ture model of hu­man prefer­ences, so is φ0-φ1-φ2-φ6 (which is in­dis­t­in­guish­able, in this ex­am­ple, from the three-fea­ture model φ0-φ1-φ2).
So we can’t say “mod­els with four fea­tures are more likely than mod­els with three”; at best we could say “mod­els with three *rele­vant* fea­tures are as likely as mod­els with four *rele­vant* fea­tures”. But, given that, the AI will still con­verge on the cor­rect model of hu­man prefer­ences.
Note that as the amount of data/​tra­jec­to­ries in­creases, the abil­ity of the AI to sep­a­rate prefer­ence from noise in­creases rapidly.

Er­ror ver­sus bias ver­sus preference

First, set noise to zero. Now imag­ine that the rocket is mov­ing at a rel­a­tive ve­loc­ity such that the ideal strat­egy to reach the space sta­tion is to ac­cel­er­ate by +1 for three more turns, and then de­cel­er­ate by −1 for sev­eral turns un­til it reaches the sta­tion.
Put the noise back up to 5%. Now, the op­ti­mum strat­egy is to start de­cel­er­at­ing im­me­di­ately (since there is a risk of un­der-ac­cel­er­at­ing dur­ing the de­cel­er­a­tion phase). In­stead, the hu­man starts ac­cel­er­at­ing by +1.
There are three pos­si­ble ex­pla­na­tions for this. Firstly, the hu­man may not ac­tu­ally want to dock at the space sta­tion. Se­condly, the hu­man may be bi­ased—over­con­fi­dent, in this case. The hu­man may be­lieve there is no noise (or that it can over­come it through willpower and fancy fly­ing?) and there­fore is fol­low­ing the ideal strat­egy in the no-noise situ­a­tion. Or the hu­man may sim­ply have made an er­ror, do­ing +1 when it meant to do −1.
Th­ese op­tions can be dis­t­in­guished by ob­serv­ing sub­se­quent be­havi­our (and be­havi­our on differ­ent tra­jec­to­ries). If we as­sume the prefer­ences are cor­rect, then a bi­ased tra­jec­tory in­volves the hu­man fol­low­ing the ideal plan for an in­cor­rect noise value, and the des­per­ately ad­just­ing at the end when they re­al­ise their plan won’t work. An er­ror, on the other hand, should re­sult in the hu­man try­ing to undo their ac­tion as best they can (say, by de­cel­er­at­ing next turn rather than fol­low­ing the +1,+1,-1,-1,… strat­egy of the no-noise world).
Th­ese are not sharp dis­tinc­tions (es­pe­cially on sin­gle tra­jec­tory or a small set of them). Maybe the hu­man has a prefer­ence for odd ma­noeu­vres as it ap­proaches the space sta­tion. Maybe it makes a mis­take ev­ery turn, and purely co­in­ci­den­tally fol­lows the right tra­jec­tory. And so on.
But this is about the most likely (sim­plest) ex­pla­na­tion. Does the hu­man show all the signs of be­ing a com­pe­tent seeker of a par­tic­u­lar goal, ex­cept that some­times they seem to do com­pletely ran­dom things, which they then try and re­pair (or shift to a com­pletely al­ter­nate strat­egy if the ran­dom ac­tion is not re­pairable)? Most likely an er­ror.
Is the hu­man be­havi­our best ex­plained by sim­ple goals, but a flaw in the strat­egy? This could hap­pen if the over­con­fi­dent hu­man always ac­cel­er­ated too fast, and then did some odd ma­noeu­vres back and forth to dock with the sta­tion. This could be odd prefer­ences for the dock­ing pro­ce­dure, but a larger set of tra­jec­to­ries could rule this out: some­times, the over­con­fi­dent hu­man will ar­rive perfectly at the sta­tion. In that case, they will *not* perform the back and forth dance, re­veal­ing that that be­havi­our was a re­sult of a flawed strat­egy (bias) rather than odd prefer­ences.
A sub­tlety in dis­t­in­guish­ing bias is when the hu­man (or maybe the sys­tem its in) uses meta-ra­tio­nal­ity to cor­rect for the bias. Maybe the hu­man is still over­con­fi­dent, but has picked up a va­ri­ety of habits that com­pen­sate for that over­con­fi­dence. How would the AI in­ter­pret some var­i­ant of overly pru­dent ap­proach phase, fol­lowed by wildly reck­less late ma­noeu­vring (when er­rors are eas­ier to com­pen­sate for)? This is not clear, and re­quires more thought.

Prefer­ence ver­sus prej­u­dice (and bias)

This is the most tricky dis­tinc­tion of all—how would you dis­t­in­guish a prej­u­dice from a true prefer­ence? One way of ap­proach­ing it is to see if pre­sent­ing the same in­for­ma­tion in differ­ent ways makes a differ­ence.
This can be at­tempted with bias as well. Sup­pose the hu­man’s re­luc­tance for ±2 ac­cel­er­a­tions is due to a bias that causes them to fear that the rocket will fall apart at those ac­cel­er­a­tions, but that bias isn’t ac­cu­rate. Then the AI can re­port ei­ther “we have an ac­cel­er­a­tion of +2” or “we have the high­est safe ac­cel­er­a­tion”. Both are say­ing the same thing, but the hu­man will be­have differ­ently in ei­ther, re­veal­ing some­thing about what is prefer­ence and what is bias.
What about prej­u­dice? Racism and sex­ism are ob­vi­ous ex­am­ples, but it’s more com­mon than that. Sup­pose the pi­lot listens to opera mu­sic while fly­ing, and un­con­sciously presses down harder on the ac­cel­er­a­tor while listen­ing to “ride of the Valkyries”. This fits perfectly into the prej­u­dice for­mat: it’s a prefer­ence that the pi­lot would want to re­move if they were in­formed about it.
To test this, the AI could offer to in­form the hu­man pi­lot of the mu­sic se­lec­tion when the pi­lot was plan­ning the flight (pos­si­bly at some small price). If the pi­lot had a gen­uine prefer­ence for “fly­ing fast when listen­ing to Wag­ner”, then this mu­sic se­lec­tion is rele­vant for their plan­ning, and they’d cer­tainly want it. If the prej­u­dice was un­con­scious, how­ever, they would see no in­ter­est in see­ing the mu­sic se­lec­tion at this point.
Once a prej­u­dice is iden­ti­fied, the AI then has the op­tion of ask­ing the hu­man di­rectly if they agree with it (thus up­grad­ing it to a true but un­known prefer­ence).

Known prejudices

Some­times, peo­ple have prej­u­dices, know about them, don’t like them, but can’t avoid them. They might then have very com­pli­cated meta-be­havi­ours to avoid fal­ling prey to them. To use the Wag­ner ex­am­ple, some­one try­ing to re­press that would seem to have the dou­ble prefer­ences “I pre­fer to never listen to Wag­ner while fly­ing” and “if, how­ever, I do hear Wag­ner, I pre­fer to fly faster”, when in fact nei­ther of these are prefer­ences.
It would seem that the sim­plest would be to have peo­ple list their un­de­sired prej­u­dices. But apart from the risk they could for­get some of them, their state­ments might be in­cor­rect. They could say “I don’t want to want to fly faster when I hear opera”, while in re­al­ity only Wag­ner causes that in them. So fur­ther anal­y­sis is re­quired be­yond sim­ply col­lect­ing these state­ments.

Re­vis­it­ing complexity

In a pre­vi­ous post, I ex­plored the idea of giv­ing the AI some vague idea of the size and com­plex­ity of hu­man prefer­ences, and that it should aim in that size for its ex­pla­na­tions. How­ever, I pointed out a trade­off: if the size was too large, the AI would la­bel prej­u­dices or bi­ases as prefer­ences, while if the size was too small, it would ig­nore gen­uine prefer­ences.
If there are ways of dis­t­in­guish­ing bi­ases and prej­u­dices from gen­uine prefer­ences, though, then there is no trade-off. Just put the ex­pected com­plex­ity for *com­bined* hu­man prefer­ences+prej­u­dices+bi­ases at some num­ber, and let al­gorithm sort out what is prefer­ence and what isn’t. It is likely much eas­ier to es­ti­mate the size of hu­man prefer­ences+pseudo-prefer­ences, than it is to iden­tify the size of true prefer­ences (that might vary more from hu­man to hu­man, for start).
I wel­come com­ments, and will let you know if this re­search an­gle goes any­where.