Distance Functions are Hard

[Epistemic sta­tus: De­scribes a failed re­search ap­proach I had a while ago, and my only pur­pose here is to warn peo­ple off from that way of think­ing. Every now and then I see some­one work­ing on an AIS sub­prob­lem say “if only we had a dis­tance func­tion for things in do­main X”, and my in­tu­ition is that they are prob­a­bly do­ing a wrong-way re­duc­tion. But I only mean this as a soft guideline, and I’m only some­what con­fi­dent in my cur­rent think­ing on this.]

~~~

Ter­minol­ogy: We use the terms dis­tance or dis­tance func­tion to de­note any func­tion that in­tu­itively tells us how “dis­similar” any two mem­bers of a set X are (re­gard­less of the whether d is a met­ric).

Coun­ter­fac­tual Worlds

Con­sider the coun­ter­fac­tual “If Lin­coln were not as­sas­si­nated, he would not have been im­peached”. If we would like to say this has a truth value, we need to imag­ine what such a coun­ter­fac­tual world would have looked like: was it be­cause Lin­coln (some­how) sur­vived his wounds, John Wilkes Booth (some­how) missed, that the plot was (some­how) dis­cov­ered the day be­fore, etc. Some­how, we must pick out the world that is in some sense “clos­est” to our ac­tual world, but it seems very difficult to com­pare any two such wor­lds in a prin­ci­pled way.

To for­mal­ize Func­tional De­ci­sion The­ory (FDT), we likely need to have a bet­ter un­der­stand­ing of coun­ter­fac­tu­als, al­though even in re­stricted math­e­mat­i­cal con­texts, we don’t have a satis­fac­tory un­der­stand­ing of why “If 0 = 1...” sim­ply re­turns in­co­her­ence, yet “If the Mo­du­lar­ity The­o­rem were false...” seem­ingly con­jures up a pos­si­ble world that we feel we can rea­son about.

(Also, in terms of cor­rigi­bil­ity, we are of­ten in­ter­ested in for­mal­iz­ing the no­tion of “low-im­pact” agents, and the naive idea one of­ten has is to define a dis­tance met­ric on coun­ter­fac­tual world-states, as in p. 5 of Con­crete Prob­lems in AI Safety).

Al­gorith­mic Similarity

In the FDT frame­work, we do not view our­selves as a soli­tary agent, but as a func­tion (or al­gorithm) that can be copied, mod­ified, and read, and we wish to max­i­mize the util­ity achieved by our al­gorithm. Minor de­tails of our im­ple­men­ta­tion that don’t af­fect our be­hav­ior (such as whether we are writ­ten in Java or Python) should not be de­ci­sion-rele­vant, and if some al­gorithm does the same thing as us “most” of the time, then we would prob­a­bly (e.g.) want to co­op­er­ate with it in a Pri­soner’s Dilemma. Defin­ing what it means for two al­gorithms to be similar re­mains an out­stand­ing open prob­lem.

At MSFP 2018, a small group (4-5) of us tried tack­ling this for a cou­ple hours, had a few ideas that “felt” promis­ing, but grad­u­ally re­al­ized that none of these made any sense, un­til ul­ti­mately we gave up with the feel­ing that we hadn’t made any in­tel­lec­tual ad­vances. I only say this to give out­side-view ev­i­dence of in­tractabil­ity, but it’s difficult for me to con­cisely com­mu­ni­cate why its hard (I could say “try it your­self for an hour and you’ll see”, but part of my point is that hour is bet­ter spent). For those who in­sist on in­side-view ev­i­dence, here’s an out­line of one of the ideas we had and why it turned out to be un­work­able:

We at­tempted to par­ti­tion al­gorithm-space into equiv­alence classes that rep­re­sent “con­cep­tual similar­ity”, which should not be harder than defin­ing a dis­tance func­tion on the space. By the Curry-Howard cor­re­spon­dence, we can rephrase this as ask­ing when two proofs are similar (this felt eas­ier for us to think about, but that’s en­tirely sub­jec­tive). Sup­pose we have some proof A of size n, and we want to find proofs that “don’t use any fun­da­men­tally differ­ent ideas”. The ob­vi­ous ap­proach is to think of which proofs we can get to with minor ed­its. If we make some edit of size ϵ⋅n for some small ϵ and the re­sult is still a valid proof, it should be more or less the same. If we take the clo­sure un­der minor ed­its that pre­serve val­idity, it would seem su­perfi­cially plau­si­ble that this would re­sult in proofs that are similar. How­ever, sup­pose we dis­cover a one-line proof B that’s to­tally differ­ent from A: then we can ap­pend it to A as a minor edit, then grad­u­ally delete A with minor ed­its, un­til we have a dras­ti­cally differ­ent proof (among other com­pli­ca­tions).

Ad­ver­sar­ial Examples

Given some data point x cor­rectly clas­sified by an ML model, a new point x′:=x+ϵ is an ad­ver­sar­ial ex­am­ple if it is now mis­clas­sified, de­spite only differ­ing from x by a tiny amount ϵ (i.e. mak­ing rel­a­tively small RGB changes to a few pix­els). For ev­ery state-of-the-art image clas­sifier tested, if you give me:

  • Any image clas­sified cor­rectly by that model

  • Any tar­get class you would like to have the model mis­clas­sify the image as

Then one can usu­ally find some small per­tur­ba­tion of that image that the model be­lieves is in the tar­get class with high prob­a­bil­ity.

In the clas­sic ex­am­ple we can have GoogLeNet clas­sify a panda as a gib­bon with 99% con­fi­dence. More­over, these have been found to gen­er­al­ize very well across differ­ent mod­els, even with very differ­ent ar­chi­tec­tures. Last year, a pa­per came out tak­ing this fur­ther, by ob­tain­ing ad­ver­sar­ial ex­am­ples with the best cross-gen­er­al­iza­tion, and giv­ing these to hu­mans who had only a few sec­onds to clas­sify the image. In­ter­est­ingly, the hu­mans were “fooled” in the sense that their snap judg­ments—those formed by their pure vi­sual sys­tem—differed from how they clas­sified the images when given more time for re­flec­tion. In terms of ro­bust­ness to these ex­am­ples, it seems, our per­cep­tual sys­tem by it­self is not qual­i­ta­tively bet­ter than to­day’s clas­sifiers, but our lens can see its own flaws.

The pa­per was pop­u­larized in var­i­ous places un­der a bolder head­line, namely that there now ex­isted full-blown ad­ver­sar­ial ex­am­ples for hu­mans (re­flec­tion or not). This was show­cased with a pic­ture from a differ­ent part of the pa­per show­ing an image of a (some­what dog-like) cat be­ing given a tiny amount of noise, and sub­se­quently look­ing like a dog to a hu­man with any amount of vi­sual pro­cess­ing and top-down feed­back. This sparked con­tro­versy, with many point­ing out that a small change (in RGB val­ues) to some vi­sual con­cept does not nec­es­sar­ily cor­re­spond to a small change in con­cept-space. The pa­per it­self punted on this:

it is philo­soph­i­cally difficult to define the real ob­ject class for an image that is not a pic­ture of a real ob­ject. In this work, we as­sume that an ad­ver­sar­ial image is mis­clas­sified if the out­put la­bel differs from the hu­man-pro­vided la­bel of the clean image that was used as the start­ing point for the ad­ver­sar­ial image. We make small ad­ver­sar­ial per­tur­ba­tions and we as­sume that these small per­tur­ba­tions are in­suffi­cient to change the true class.

And in re­sponse to com­ments, co-au­thor Ian Good­fel­low ac­knowl­edged on Twit­ter:

While ev­ery­one else was scram­bling to finish run­ning ex­per­i­ments for ICML, my co-au­thors and I were hav­ing in­tense de­bates about philos­o­phy and se­man­tics and how to write the pa­per. Some of our open office col­leagues were en­ter­tained by how sur­real this sounded.

Mak­ing mod­els ro­bust against ad­ver­sar­ial ex­am­ples re­mains an out­stand­ing and difficult topic with a con­sid­er­able pa­per trail. The prob­lem of merely ver­ify­ing that a given model has no lo­cal ad­ver­sar­ial ex­am­ples (e.g. within a few RGB val­ues of a given data point) has been the sub­ject of some in­ter­est­ing for­mal ver­ifi­ca­tion work in the past cou­ple years. But to even do this ver­ifi­ca­tion work, one needs a for­mal speci­fi­ca­tion of what an ad­ver­sar­ial ex­am­ple is, which in turn re­quires a for­mal speci­fi­ca­tion of what a “small change” be­tween (e.g.) images is, that some­how cap­tures some­thing about con­cep­tual dis­tance. It seems to me that even this smaller prob­lem will be hard to solve in a philo­soph­i­cally satis­fy­ing way be­cause of the in­her­ent sub­jec­tivity/​fuzzi­ness in defin­ing “dis­tance in con­cept-space” or any­thing that even comes close.

Dis­tance Func­tions are Hard: The Evidence

What we are ask­ing for, in all these in­stances, is some dis­tance func­tion pre­cise enough to be math­ema­ti­z­able in some form, but ro­bust enough to in­clude many very fuzzy desider­ata we have in mind. It seems nat­u­ral to ask what dis­tance func­tions of this form have been suc­cess­fully de­vel­oped be­fore. The En­cy­clo­pe­dia of Dis­tances comes out to over 700 pages, split roughly in half be­tween those dis­tances used in pure math (es­pe­cially, as one would ex­pect, topol­ogy, ge­om­e­try, and func­tional anal­y­sis), and those used in ap­plied math, com­put­ing dis­ci­plines, and the nat­u­ral sci­ences.

Of the dis­tance func­tions listed in the lat­ter half, most were sim­ply “the ob­vi­ous thing one would do” given the pre­ex­ist­ing math­e­mat­i­cal struc­ture around the topic in ques­tion (e.g. Leven­shtein dis­tance on strings). Others were less ob­vi­ous, but usu­ally be­cause they used non­triv­ial math­e­mat­i­cal ma­chin­ery to an­swer spe­cific math­e­mat­i­cal ques­tions, not to ac­tu­ally shed light on fuzzy philo­soph­i­cal ques­tions one would have about it.

Get­ting to the so­cial sci­ence sec­tion, where no ex­ist­ing math­e­mat­i­cal for­mal­ism ex­isted on most of the top­ics in the first place, vir­tu­ally none of the dis­tances par­tic­u­larly helped to rem­edy this fuzzi­ness by them­selves. Though I do not claim to have spent that much time flip­ping through this tome, never did I see a dis­tance no­tion that struck me as a profound non-math­e­mat­i­cal in­sight, or that even ges­tured at an “art of com­ing up with dis­tance func­tions”.

Conclusions

I con­clude, with medium con­fi­dence, that each of the ques­tions posed in the first 3 sec­tions will be par­tic­u­larly hard to an­swer in a satis­fy­ing way, and if they are, then prob­a­bly this won’t be by think­ing about dis­tance func­tions di­rectly.

As a gen­eral heuris­tic, I feel like if you’ve re­duced a philo­soph­i­cal prob­lem to “defin­ing the ap­pro­pri­ate dis­tance func­tion”, then it’s worth paus­ing to con­sider if you’ve made a wrong-way re­duc­tion. Chances are, the dis­tance func­tion you want is in­her­ently value-laden, and so the prob­lem of defin­ing it in­her­its the difficulty of the value al­ign­ment prob­lem it­self.

I also think this heuris­tic is es­pe­cially salient if you’re try­ing to cap­ture some­thing like “con­cep­tual similar­ity/​dis­tance”: if you could do this, then you’d have an ob­jec­tive map/​tax­on­omy of (a large frac­tion of) con­cept-space.