Uncertainty versus fuzziness versus extrapolation desiderata

I pro­posed a way around Good­hart’s curse. Essen­tially this re­duces to prop­erly ac­count­ing all of our un­cer­tainty about our val­ues, in­clud­ing some meta-un­cer­tainty about whether we’ve prop­erly ac­counted for all our un­cer­tainty.

Wei Dai had some ques­tions about the ap­proach, point­ing out that it seemed to have a similar prob­lem as cor­rigi­bil­ity: once the AI has re­solved all un­cer­tainty about our val­ues, then there’s noth­ing left. I re­sponded by talk­ing about fuzzi­ness rather than un­cer­tainty.

Re­solv­ing am­bi­guity, sharply or fuzzily

We have a hu­man , who hasn’t yet ded­i­cated any real thought to pop­u­la­tion ethics. We run a hun­dred “rea­son­able” simu­la­tions where we in­tro­duce to pop­u­la­tion ethics, vary­ing the pre­sen­ta­tion a bit, and ul­ti­mately ask for their opinion.

In 45 of these runs, they en­dorsed to­tal util­i­tar­i­anism, in 15 of them, they en­dorsed av­er­age util­i­tar­i­anism, and in 40 of them, they en­dorsed some com­pro­mise sys­tem (say the one I sug­gested here).

That’s it. There is no more un­cer­tainty; we know ev­ery­thing there is to know about ‘s po­ten­tial opinions on pop­u­la­tion ethics. What we do with this in­for­ma­tion—how we define ‘s “ac­tual” opinion—is up to us (ne­glect­ing, for the mo­ment, the is­sue of ’s meta-prefer­ences, which likely suffer from a similar type of am­bi­guity).

We could round these prefer­ences to “to­tal util­i­tar­i­anism”. That would be the sharpest op­tion.

We could nor­mal­ise those three util­ity func­tions, then add them with the 45-15-40 rel­a­tive weights.

Or we could do a similar nor­mal­i­sa­tion, but, mind­ful of frag­ility of value, we could ei­ther move the ma­jor op­tions to equal weights 1-1-1, or stick with 45-15-40 but use some smooth min­i­mum on the com­bi­na­tion. Th­ese would be the more fuzzy choices.

All of these op­tions are valid, given that we haven’t defined any way of re­solv­ing am­bigu­ous situ­a­tions like that. And note that fuzzi­ness looks a lot like un­cer­tainty, in that a high fuzzi­ness mix looks like what you’d have as util­ity func­tion if you were very un­cer­tain. But, un­like un­cer­tainty, know­ing more in­for­ma­tion doesn’t “re­solve” this fuzzi­ness. That’s why Jes­sica’s cri­tique of cor­rigi­bil­ity doesn’t ap­ply to this situ­a­tion.

(And note also that we could in­tro­duce fuzzi­ness for differ­ent rea­sons—we could be­lieve that this a gen­uinely good way of re­solv­ing com­pet­ing val­ues, or it could be to cover un­cer­tainty that would be too dan­ger­ous to have the AI re­solve, or we could in­tro­duce it to avoid po­ten­tial Good­hart prob­lems, with­out be­liev­ing that the fuzzi­ness is “real”).

Fuzzi­ness and choices in ex­trap­o­lat­ing concepts

The pic­ture where we have 45-15-40 weights on well-defined moral the­o­ries, is not a re­al­is­tic start­ing point for es­tab­lish­ing hu­man val­ues. We hu­mans start mainly with par­tial prefer­ences, or just lists of ex­am­ple of cor­rect and in­cor­rect be­havi­ours in a nar­row span of cir­cum­stance.

Ex­trap­o­lat­ing from these ex­am­ples to a weight­ing on moral the­o­ries is a pro­cess that is en­tirely un­der hu­man con­trol. We de­cide how to do so, thus in­cor­po­rat­ing our meta-prefer­ence im­plic­itly in the pro­cess and its out­come.

Ex­trap­o­lat­ing dogs and cats and other things

Con­sider the su­per­vised learn­ing task of sep­a­rat­ing pho­tos of dogs from pho­tos of non-dogs. We hand the neu­ral net a bunch of la­bel­led pho­tos, and tell it to go to work. It now has to draw a con­cep­tual bound­ary around “dog”.

What is the AI’s con­cept of “dog” ul­ti­mately grounded on? It’s ob­vi­ously not just on the spe­cific pho­tos we handed it—that way lies overfit­ting and mad­ness.

But nor can we gen­er­ate ev­ery pos­si­ble set of pix­els and have a hu­man la­bel them as dog or non-dog. Take for ex­am­ple the fol­low­ing image:

That, ap­par­ently, is a cat, but I’ve checked with peo­ple at the FHI and we con­sis­tently mis-iden­ti­fied it. How­ever, a suffi­ciently smart AI might be able to de­tect some im­plicit cat-like fea­tures that aren’t salient to us, and cor­rectly la­bel it as non-dog.

Thus, in or­der to cor­rectly iden­tify the term “dog”, defined by hu­man la­bel­ling, the AI has to dis­agree with… hu­man la­bel­ling. There are more egre­gious non-dogs that could get la­bel­led as “dogs”, such as a photo of a close friend with a sign that says “Help! they’ll let me go if you la­bel this image as a dog”.

Hu­man choices in image recog­ni­tion boundaries

When we pro­gram a neu­ral net to clas­sify dogs, we make a lot of choices—the size of the neu­ral net, ac­ti­va­tion func­tions and other hy­per-pa­ram­e­ters, the size and con­tents of the train­ing, test, and val­i­da­tion sets, whether to tweak the net­work af­ter the first run, whether to pub­lish the re­sults or bury them, or so on.

Some of these choice can be seen as ex­actly the “fuzzi­ness” which I defined above—some op­tions de­ter­mine whether the bound­ary is drawn tightly or loosely around the ex­am­ples of “dog”, and whether am­bigu­ous op­tions are pushed to one cat­e­gory or al­lowed to re­main am­bigu­ous. But some of these choices—such as meth­ods for avoid­ing sam­pling bi­ases or ad­ver­sar­ial learn­ing ex­am­ple of a panda as a gib­bon—are much more com­pli­cated than just “sharp ver­sus fuzzy”. I’ll call these choices “ex­trap­o­la­tion choices”, as they de­ter­mine how the AI ex­trap­o­lates from the ex­am­ple we have given it.

Hu­man choices in prefer­ence recog­ni­tion boundaries

The same will ap­ply to AIs es­ti­mat­ing hu­man prefer­ences. So we have three types of things here:

  • Uncer­tainty: this is when the AI is ig­no­rant about some­thing in the world. Can be re­solved by fur­ther knowl­edge.

  • Fuzzi­ness: this how the AI re­solves am­bi­guity be­tween prefer­ence-rele­vant cat­e­gories. It can look like un­cer­tainty, but is ac­tu­ally an ex­trap­o­la­tion choice, and can’t be re­solved by fur­ther knowl­edge.

  • Ex­trap­o­la­tion desider­ata: ex­trap­o­la­tion choices are what need to me made to con­struct a full clas­sifi­ca­tion or prefer­ence func­tion from un­der­defined ex­am­ples. Ex­trap­o­la­tion desider­ata are the for­mal and in­for­mal prop­er­ties that we would want these ex­trap­o­la­tion choices to have.

So when I wrote that to avoid Good­hart prob­lems “The im­por­tant thing is to cor­rectly model my un­cer­tainty and over­con­fi­dence.”, I can now re­fine that into:

  • The im­por­tant thing is to cor­rectly model my fuzzi­ness, and my ex­trap­o­la­tion desider­ata.

Neat and el­e­gant! How­ever, to make it more ap­pli­ca­ble, I un­for­tu­nately need to ex­tend it in a less el­e­gant fash­ion:

  • The im­por­tant thing is to cor­rectly model my fuzzi­ness, and my ex­trap­o­la­tion desider­ata, in­clud­ing any meta-desider­ata I might have for how to model this cor­rectly (and any er­rors I might be mak­ing, that I would de­sire to have recog­nised as er­rors).

Note that there is no longer any deep need to model “my” un­cer­tainty. It is still im­por­tant to model un­cer­tainty about the real world cor­rectly, and if I’m mis­taken about the real world, this may be rele­vant to what I be­lieve my ex­trap­o­la­tion desider­ata are. But mod­el­ling my un­cer­tainty is merely in­stru­men­tally use­ful, but mod­el­ling my fuzzi­ness is a ter­mi­nal goal if we want to get it right.

As a minor ex­am­ple of the challenge of the above, con­sider that this would have needed to be able to de­tect that ad­ver­sar­ial ex­am­ples were prob­le­matic, be­fore any­one had con­ceived of the idea.

I won’t de­velop this too much more here, as the ideas will be in­cluded in my re­search agenda whose first draft should be pub­lished here soon.