# Distance Functions are Hard

[Epistemic sta­tus: De­scribes a failed re­search ap­proach I had a while ago, and my only pur­pose here is to warn peo­ple off from that way of think­ing. Every now and then I see some­one work­ing on an AIS sub­prob­lem say “if only we had a dis­tance func­tion for things in do­main X”, and my in­tu­ition is that they are prob­a­bly do­ing a wrong-way re­duc­tion. But I only mean this as a soft guideline, and I’m only some­what con­fi­dent in my cur­rent think­ing on this.]

~~~

Ter­minol­ogy: We use the terms dis­tance or dis­tance func­tion to de­note any func­tion that in­tu­itively tells us how “dis­similar” any two mem­bers of a set X are (re­gard­less of the whether d is a met­ric).

## Coun­ter­fac­tual Worlds

Con­sider the coun­ter­fac­tual “If Lin­coln were not as­sas­si­nated, he would not have been im­peached”. If we would like to say this has a truth value, we need to imag­ine what such a coun­ter­fac­tual world would have looked like: was it be­cause Lin­coln (some­how) sur­vived his wounds, John Wilkes Booth (some­how) missed, that the plot was (some­how) dis­cov­ered the day be­fore, etc. Some­how, we must pick out the world that is in some sense “clos­est” to our ac­tual world, but it seems very difficult to com­pare any two such wor­lds in a prin­ci­pled way.

To for­mal­ize Func­tional De­ci­sion The­ory (FDT), we likely need to have a bet­ter un­der­stand­ing of coun­ter­fac­tu­als, al­though even in re­stricted math­e­mat­i­cal con­texts, we don’t have a satis­fac­tory un­der­stand­ing of why “If 0 = 1...” sim­ply re­turns in­co­her­ence, yet “If the Mo­du­lar­ity The­o­rem were false...” seem­ingly con­jures up a pos­si­ble world that we feel we can rea­son about.

(Also, in terms of cor­rigi­bil­ity, we are of­ten in­ter­ested in for­mal­iz­ing the no­tion of “low-im­pact” agents, and the naive idea one of­ten has is to define a dis­tance met­ric on coun­ter­fac­tual world-states, as in p. 5 of Con­crete Prob­lems in AI Safety).

## Al­gorith­mic Similarity

In the FDT frame­work, we do not view our­selves as a soli­tary agent, but as a func­tion (or al­gorithm) that can be copied, mod­ified, and read, and we wish to max­i­mize the util­ity achieved by our al­gorithm. Minor de­tails of our im­ple­men­ta­tion that don’t af­fect our be­hav­ior (such as whether we are writ­ten in Java or Python) should not be de­ci­sion-rele­vant, and if some al­gorithm does the same thing as us “most” of the time, then we would prob­a­bly (e.g.) want to co­op­er­ate with it in a Pri­soner’s Dilemma. Defin­ing what it means for two al­gorithms to be similar re­mains an out­stand­ing open prob­lem.

At MSFP 2018, a small group (4-5) of us tried tack­ling this for a cou­ple hours, had a few ideas that “felt” promis­ing, but grad­u­ally re­al­ized that none of these made any sense, un­til ul­ti­mately we gave up with the feel­ing that we hadn’t made any in­tel­lec­tual ad­vances. I only say this to give out­side-view ev­i­dence of in­tractabil­ity, but it’s difficult for me to con­cisely com­mu­ni­cate why its hard (I could say “try it your­self for an hour and you’ll see”, but part of my point is that hour is bet­ter spent). For those who in­sist on in­side-view ev­i­dence, here’s an out­line of one of the ideas we had and why it turned out to be un­work­able:

We at­tempted to par­ti­tion al­gorithm-space into equiv­alence classes that rep­re­sent “con­cep­tual similar­ity”, which should not be harder than defin­ing a dis­tance func­tion on the space. By the Curry-Howard cor­re­spon­dence, we can rephrase this as ask­ing when two proofs are similar (this felt eas­ier for us to think about, but that’s en­tirely sub­jec­tive). Sup­pose we have some proof A of size n, and we want to find proofs that “don’t use any fun­da­men­tally differ­ent ideas”. The ob­vi­ous ap­proach is to think of which proofs we can get to with minor ed­its. If we make some edit of size ϵ⋅n for some small ϵ and the re­sult is still a valid proof, it should be more or less the same. If we take the clo­sure un­der minor ed­its that pre­serve val­idity, it would seem su­perfi­cially plau­si­ble that this would re­sult in proofs that are similar. How­ever, sup­pose we dis­cover a one-line proof B that’s to­tally differ­ent from A: then we can ap­pend it to A as a minor edit, then grad­u­ally delete A with minor ed­its, un­til we have a dras­ti­cally differ­ent proof (among other com­pli­ca­tions).

Given some data point x cor­rectly clas­sified by an ML model, a new point x′:=x+ϵ is an ad­ver­sar­ial ex­am­ple if it is now mis­clas­sified, de­spite only differ­ing from x by a tiny amount ϵ (i.e. mak­ing rel­a­tively small RGB changes to a few pix­els). For ev­ery state-of-the-art image clas­sifier tested, if you give me:

• Any image clas­sified cor­rectly by that model

• Any tar­get class you would like to have the model mis­clas­sify the image as

Then one can usu­ally find some small per­tur­ba­tion of that image that the model be­lieves is in the tar­get class with high prob­a­bil­ity.

In the clas­sic ex­am­ple we can have GoogLeNet clas­sify a panda as a gib­bon with 99% con­fi­dence. More­over, these have been found to gen­er­al­ize very well across differ­ent mod­els, even with very differ­ent ar­chi­tec­tures. Last year, a pa­per came out tak­ing this fur­ther, by ob­tain­ing ad­ver­sar­ial ex­am­ples with the best cross-gen­er­al­iza­tion, and giv­ing these to hu­mans who had only a few sec­onds to clas­sify the image. In­ter­est­ingly, the hu­mans were “fooled” in the sense that their snap judg­ments—those formed by their pure vi­sual sys­tem—differed from how they clas­sified the images when given more time for re­flec­tion. In terms of ro­bust­ness to these ex­am­ples, it seems, our per­cep­tual sys­tem by it­self is not qual­i­ta­tively bet­ter than to­day’s clas­sifiers, but our lens can see its own flaws.

The pa­per was pop­u­larized in var­i­ous places un­der a bolder head­line, namely that there now ex­isted full-blown ad­ver­sar­ial ex­am­ples for hu­mans (re­flec­tion or not). This was show­cased with a pic­ture from a differ­ent part of the pa­per show­ing an image of a (some­what dog-like) cat be­ing given a tiny amount of noise, and sub­se­quently look­ing like a dog to a hu­man with any amount of vi­sual pro­cess­ing and top-down feed­back. This sparked con­tro­versy, with many point­ing out that a small change (in RGB val­ues) to some vi­sual con­cept does not nec­es­sar­ily cor­re­spond to a small change in con­cept-space. The pa­per it­self punted on this:

it is philo­soph­i­cally difficult to define the real ob­ject class for an image that is not a pic­ture of a real ob­ject. In this work, we as­sume that an ad­ver­sar­ial image is mis­clas­sified if the out­put la­bel differs from the hu­man-pro­vided la­bel of the clean image that was used as the start­ing point for the ad­ver­sar­ial image. We make small ad­ver­sar­ial per­tur­ba­tions and we as­sume that these small per­tur­ba­tions are in­suffi­cient to change the true class.

And in re­sponse to com­ments, co-au­thor Ian Good­fel­low ac­knowl­edged on Twit­ter:

While ev­ery­one else was scram­bling to finish run­ning ex­per­i­ments for ICML, my co-au­thors and I were hav­ing in­tense de­bates about philos­o­phy and se­man­tics and how to write the pa­per. Some of our open office col­leagues were en­ter­tained by how sur­real this sounded.

Mak­ing mod­els ro­bust against ad­ver­sar­ial ex­am­ples re­mains an out­stand­ing and difficult topic with a con­sid­er­able pa­per trail. The prob­lem of merely ver­ify­ing that a given model has no lo­cal ad­ver­sar­ial ex­am­ples (e.g. within a few RGB val­ues of a given data point) has been the sub­ject of some in­ter­est­ing for­mal ver­ifi­ca­tion work in the past cou­ple years. But to even do this ver­ifi­ca­tion work, one needs a for­mal speci­fi­ca­tion of what an ad­ver­sar­ial ex­am­ple is, which in turn re­quires a for­mal speci­fi­ca­tion of what a “small change” be­tween (e.g.) images is, that some­how cap­tures some­thing about con­cep­tual dis­tance. It seems to me that even this smaller prob­lem will be hard to solve in a philo­soph­i­cally satis­fy­ing way be­cause of the in­her­ent sub­jec­tivity/​fuzzi­ness in defin­ing “dis­tance in con­cept-space” or any­thing that even comes close.

## Dis­tance Func­tions are Hard: The Evidence

What we are ask­ing for, in all these in­stances, is some dis­tance func­tion pre­cise enough to be math­ema­ti­z­able in some form, but ro­bust enough to in­clude many very fuzzy desider­ata we have in mind. It seems nat­u­ral to ask what dis­tance func­tions of this form have been suc­cess­fully de­vel­oped be­fore. The En­cy­clo­pe­dia of Dis­tances comes out to over 700 pages, split roughly in half be­tween those dis­tances used in pure math (es­pe­cially, as one would ex­pect, topol­ogy, ge­om­e­try, and func­tional anal­y­sis), and those used in ap­plied math, com­put­ing dis­ci­plines, and the nat­u­ral sci­ences.

Of the dis­tance func­tions listed in the lat­ter half, most were sim­ply “the ob­vi­ous thing one would do” given the pre­ex­ist­ing math­e­mat­i­cal struc­ture around the topic in ques­tion (e.g. Leven­shtein dis­tance on strings). Others were less ob­vi­ous, but usu­ally be­cause they used non­triv­ial math­e­mat­i­cal ma­chin­ery to an­swer spe­cific math­e­mat­i­cal ques­tions, not to ac­tu­ally shed light on fuzzy philo­soph­i­cal ques­tions one would have about it.

Get­ting to the so­cial sci­ence sec­tion, where no ex­ist­ing math­e­mat­i­cal for­mal­ism ex­isted on most of the top­ics in the first place, vir­tu­ally none of the dis­tances par­tic­u­larly helped to rem­edy this fuzzi­ness by them­selves. Though I do not claim to have spent that much time flip­ping through this tome, never did I see a dis­tance no­tion that struck me as a profound non-math­e­mat­i­cal in­sight, or that even ges­tured at an “art of com­ing up with dis­tance func­tions”.

## Conclusions

I con­clude, with medium con­fi­dence, that each of the ques­tions posed in the first 3 sec­tions will be par­tic­u­larly hard to an­swer in a satis­fy­ing way, and if they are, then prob­a­bly this won’t be by think­ing about dis­tance func­tions di­rectly.

As a gen­eral heuris­tic, I feel like if you’ve re­duced a philo­soph­i­cal prob­lem to “defin­ing the ap­pro­pri­ate dis­tance func­tion”, then it’s worth paus­ing to con­sider if you’ve made a wrong-way re­duc­tion. Chances are, the dis­tance func­tion you want is in­her­ently value-laden, and so the prob­lem of defin­ing it in­her­its the difficulty of the value al­ign­ment prob­lem it­self.

I also think this heuris­tic is es­pe­cially salient if you’re try­ing to cap­ture some­thing like “con­cep­tual similar­ity/​dis­tance”: if you could do this, then you’d have an ob­jec­tive map/​tax­on­omy of (a large frac­tion of) con­cept-space.

• Learn­ing a dis­tance func­tion be­tween pic­tures of hu­man faces has been used suc­cess­fully to train deep learn­ing based face recog­ni­tion sys­tems.

My take­away from your ex­am­ples is not that “dis­tance func­tions are hard” so much as “hard­cod­ing is brit­tle”. The gen­eral ap­proach of “define a dis­tance func­tion and train a model based on it” has been pretty suc­cess­ful in ma­chine learn­ing.

• At the same time, the im­por­tance of hav­ing a good dis­tance/​di­ver­gence, the lack of ap­pro­pri­ate ones, and the difficulty of learn­ing them are widely ac­knowl­edged challenges in ma­chine learn­ing.

A dis­tance func­tion is fairly similar to a rep­re­sen­ta­tion in my mind, and high-qual­ity rep­re­sen­ta­tion learn­ing is con­sid­ered a bit of a holy grail open prob­lem.

Ma­chine learn­ing re­lies on for­mu­lat­ing *some* sort of ob­jec­tive, which can be viewed as analo­gous to the choice of a good dis­tance func­tion, so I think the cen­tral point of the post (as I un­der­stood it from a quick glance) is cor­rect: “spec­i­fy­ing a good dis­tance mea­sure is not that much eas­ier than spec­i­fy­ing a good ob­jec­tive”.

It’s also an open ques­tion how much learn­ing, (rel­a­tively) generic pri­ors, and big data can ac­tu­ally solve the is­sue of weak learn­ing sig­nals and weak pri­ors for us. A lot of peo­ple are bet­ting pretty hard on that; I think its plau­si­ble, but not very likely. I think its more like a recipe for un­al­igned AI, and we need to get more bits of in­for­ma­tion about what we ac­tu­ally want into AI sys­tems some­how. Highly in­ter­ac­tive train­ing pro­to­cols seem su­per valuable for that, but the ML com­mu­nity has a strong prefer­ence against such work be­cause it is a mas­sive pain com­pared to the non-in­ter­ac­tive UL/​SL/​RL set­tings that are pop­u­lar.

• Why are highly in­ter­ac­tive train­ing pro­to­cols a mas­sive pain?

Do you have any thoughts on self-su­per­vised learn­ing? That’s my cur­rent guess for how we’ll get AGI, and it’s a frame­work that makes the al­ign­ment prob­lem seem rel­a­tively straight­for­ward to me.

• They’re a pain be­cause they in­volve a lot of hu­man la­bor, slow down the ex­per­i­ment loop, make re­pro­duc­ing re­sults harder, etc.

RE self-su­per­vised learn­ing: I don’t see why we needed the re­brand­ing (of un­su­per­vised learn­ing). I don’t see why it would make al­ign­ment straight­for­ward (ETA: ex­cept to the ex­tent that you aren’t nec­es­sar­ily, de­liber­ately build­ing some­thing agenty). The bound­aries be­tween SSL and other ML is fuzzy; I don’t think we’ll get to AGI us­ing just SSL and noth­ing like RL. SSL doesn’t solve the ex­plo­ra­tion prob­lem, if you start car­ing about ex­plo­ra­tion, I think you end up do­ing things that look more like RL.

I also tend to agree (e.g. with that gw­ern ar­ti­cle) that AGI de­signs that aren’t agenty are go­ing to be at a sig­nifi­cant com­pet­i­tive dis­ad­van­tage, so prob­a­bly aren’t a satis­fy­ing solu­tion to al­ign­ment, but could be a stop-gap.

• They’re a pain be­cause they in­volve a lot of hu­man la­bor, slow down the ex­per­i­ment loop, make re­pro­duc­ing re­sults harder, etc.

I see. How about do­ing ac­tive learn­ing of com­putable func­tions? That solves all 3 prob­lems.

In­stead of stan­dard bench­marks, you could offer an API which pro­vides an or­a­cle for some se­cret func­tions to be learned. You could run a com­pe­ti­tion ev­ery X months and give each com­pe­ti­tion en­trant a bud­get of Y API calls over the course of the com­pe­ti­tion.

RE self-su­per­vised learn­ing: I don’t see why we needed the re­brand­ing (of un­su­per­vised learn­ing).

Well I don’t see why neu­ral net­works needed to be re­branded as “deep learn­ing” ei­ther :-)

When I talk about “self-su­per­vised learn­ing”, I re­fer to chop­ping up your train­ing set into au­to­mat­i­cally cre­ated su­per­vised learn­ing prob­lems (pre­dic­tive pro­cess­ing), which feels differ­ent from clus­ter­ing/​di­men­sion­al­ity re­duc­tion. It seems like a promis­ing ap­proach re­gard­less of what you call it.

I don’t see why it would make al­ign­ment straight­for­ward (ETA: ex­cept to the ex­tent that you aren’t nec­es­sar­ily, de­liber­ately build­ing some­thing agenty).

In or­der to make ac­cu­rate pre­dic­tions about re­al­ity, you need to un­der­stand hu­mans, be­cause hu­mans ex­ist in re­al­ity. So at the very least, a su­per­in­tel­li­gent self-su­per­vised learn­ing sys­tem trained on loads of hu­man data would have a lot of con­cep­tual build­ing blocks (de­vel­oped in or­der to make pre­dic­tions about its train­ing data) which could be tweaked and com­bined to make pre­dic­tions about hu­man val­ues (analo­gous to fine-tun­ing in the con­text of trans­fer learn­ing). But I sus­pect fine-tun­ing might not even be nec­es­sary. Just ask it what Gandhi would do or some­thing like that.

Re: gw­ern’s ar­ti­cle, RL does not seem to me like a good fit for most of the prob­lems he de­scribes. I agree ac­tive learn­ing/​in­ter­ac­tive train­ing pro­to­cols are pow­er­ful, but that’s not the same as RL.

Au­ton­omy is also nice (and also not the same as RL). I think the solu­tion for au­ton­omy is (1) solve cal­ibra­tion/​dis­tri­bu­tional shift, so the sys­tem knows when it’s safe to act au­tonomously (2) have the sys­tem ad­just its own level of au­ton­omy/​need for clar­ifi­ca­tion dy­nam­i­cally de­pend­ing on the ap­par­ent ur­gency of its cir­cum­stances. I have notes for a post about (2), let me know if you think I should pri­ori­tize writ­ing it.

• I see. How about do­ing ac­tive learn­ing of com­putable func­tions? That solves all 3 problems

^ I don’t see how?

I should elab­o­rate… it sounds like your think­ing of ac­tive learn­ing (where the AI can choose to make queries for in­for­ma­tion, e.g. la­bels), but I’m talk­ing about *in­ter*ac­tive train­ing, where a hu­man su­per­vi­sor is *also* ac­tively mon­i­tor­ing the AI sys­tem, mak­ing queries of it, and in­tel­li­gently se­lect­ing feed­back for the AI. This might be simu­lated as well, us­ing mul­ti­ple AIs, and there might be a lot of room for good work there… but I think if we want to solve al­ign­ment, we want a deep and satis­fy­ing un­der­stand­ing of AI sys­tems, which seems hard to come by with­out rich feed­back loops be­tween hu­mans and AIs. Ba­si­cally, by in­ter­ac­tive train­ing, I have in mind some­thing where train­ing AIs looks more like teach­ing other hu­mans.

So at the very least, a su­per­in­tel­li­gent self-su­per­vised learn­ing sys­tem trained on loads of hu­man data would have a lot of con­cep­tual build­ing blocks (de­vel­oped in or­der to make pre­dic­tions about its train­ing data) which could be tweaked and com­bined to make pre­dic­tions about hu­man val­ues (analo­gous to fine-tun­ing in the con­text of trans­fer learn­ing).

I think it’s a very open ques­tion how well we can ex­pect ad­vanced AI sys­tems to un­der­stand or mir­ror hu­man con­cepts by de­fault. Ad­ver­sar­ial ex­am­ples sug­gest we should be wor­ried that ap­par­ently similar con­cepts will ac­tu­ally be wildly differ­ent in non-ob­vi­ous ways. I’m cau­tiously op­ti­mistic, since this could make things a lot eas­ier. It’s also un­clear ATM how pre­cisely AI con­cepts need to track hu­man con­cepts in or­der for things to work out OK. The “basin of at­trac­tion” line of thought sug­gests that they don’t need to be that great, be­cause they can self-cor­rect or learn to defer to hu­mans ap­pro­pri­ately. My prob­lem with that ar­gu­ment is that it seems like we will have so many chances to fuck up that we would need 1) AI sys­tems to be ex­tremely re­li­able, or 2) for catas­trophic mis­takes to be rare, and minor mis­takes to be tran­sient or de­tectable. (2) seems plau­si­ble to me in many ap­pli­ca­tions, but prob­a­bly not all of the ap­pli­ca­tions where peo­ple will want to use SOTA AI.

Re: gw­ern’s ar­ti­cle, RL does not seem to me like a good fit for most of the prob­lems he de­scribes. I agree ac­tive learn­ing/​in­ter­ac­tive train­ing pro­to­cols are pow­er­ful, but that’s not the same as RL.

Yes ofc they are differ­ent.

I think al­gorithms the sig­nifi­cant fea­tures of RL here are: 1) hav­ing the goal of un­der­stand­ing the world and how to in­fluence it, and 2) do­ing (pos­si­bly im­plicit) plan­ning. RL can also be pointed at nar­row do­mains, but for a lot of prob­lems, I think hav­ing gen­eral knowl­edge will be very valuable, and hard to repli­cate with a net­work of nar­row sys­tems.

I think the solu­tion for au­ton­omy is (1) solve cal­ibra­tion/​dis­tri­bu­tional shift, so the sys­tem knows when it’s safe to act au­tonomously (2) have the sys­tem ad­just its own level of au­ton­omy/​need for clar­ifi­ca­tion dy­nam­i­cally de­pend­ing on the ap­par­ent ur­gency of its cir­cum­stances.

That seems great, but also likely to be very difficult, es­pe­cially if we de­mand high re­li­a­bil­ity and perfor­mance.

• ^ I don’t see how?

No hu­man la­bor: Just com­pute the func­tion. Fast ex­per­i­ment loop: Com­put­ers are faster than hu­mans. Re­pro­ducible: Share the code for your func­tion with oth­ers.

I think for a suffi­ciently ad­vanced AI sys­tem, as­sum­ing it’s well put to­gether, ac­tive learn­ing can beat this sort of in­ter­ac­tive train­ing—the AI will be bet­ter at the task of iden­ti­fy­ing & fix­ing po­ten­tial weak­nesses in its mod­els than hu­mans.

Ad­ver­sar­ial ex­am­ples sug­gest we should be wor­ried that ap­par­ently similar con­cepts will ac­tu­ally be wildly differ­ent in non-ob­vi­ous ways.

I think the prob­lem with ad­ver­sar­ial ex­am­ples is that deep neu­ral nets don’t have the right in­duc­tive bi­ases. I ex­pect meta-learn­ing ap­proaches which iden­tify & ac­quire new in­duc­tive bi­ases (in or­der to de­ter­mine “how to think” about a par­tic­u­lar do­main) will solve this prob­lem and will also be nec­es­sary for AGI any­way.

BTW, differ­ent hu­man brains ap­pear to learn differ­ent rep­re­sen­ta­tions (pre­vi­ous dis­cus­sion), and yet we are ca­pa­ble of del­e­gat­ing tasks to each other.

I’m cau­tiously op­ti­mistic, since this could make things a lot eas­ier.

Huh?

My prob­lem with that ar­gu­ment is that it seems like we will have so many chances to fuck up that we would need 1) AI sys­tems to be ex­tremely re­li­able, or 2) for catas­trophic mis­takes to be rare, and minor mis­takes to be tran­sient or de­tectable. (2) seems plau­si­ble to me in many ap­pli­ca­tions, but prob­a­bly not all of the ap­pli­ca­tions where peo­ple will want to use SOTA AI.

Maybe. But my in­tu­ition is that if you can cre­ate a su­per­in­tel­li­gent sys­tem, you can make one which is “su­per­hu­manly re­li­able” even in do­mains which are novel to it. I think the core prob­lems for re­li­able AI are very similar to the core prob­lems for AI in gen­eral. An ex­am­ple is the fact that solv­ing ad­ver­sar­ial ex­am­ples and im­prov­ing clas­sifi­ca­tion ac­cu­racy seem in­ti­mately re­lated.

I think al­gorithms the sig­nifi­cant fea­tures of RL here are: 1) hav­ing the goal of un­der­stand­ing the world and how to in­fluence it, and 2) do­ing (pos­si­bly im­plicit) plan­ning.

In what sense does RL try to un­der­stand the world? It seems very much not fo­cused on that. You es­sen­tially have to hand it a rea­son­ably ac­cu­rate simu­la­tion of the world (i.e. a world that is already fully un­der­stood, in the sense that we have a great model for it) for it to do any­thing in­ter­est­ing.

If the plan­ning is only “im­plicit”, RL sounds like overkill and prob­a­bly not a great fit. RL seems rel­a­tively good at long se­quences of ac­tions for a state­ful sys­tem we have a great model of. If most of the value can be ob­tained by plan­ning 1 step in ad­vance, RL seems like a solu­tion to a prob­lem you don’t have. It is likely to make your sys­tem less safe, since plan­ning many steps in ad­vance could let it plot some kind of treach­er­ous turn. But I also don’t think you will gain much through us­ing it. So luck­ily, I don’t think there is a big ca­pa­bil­ities vs safety trade­off here.

I think hav­ing gen­eral knowl­edge will be very valuable, and hard to repli­cate with a net­work of nar­row sys­tems.

Agreed. But gen­eral knowl­edge is also not RL, and is han­dled much more nat­u­rally in other frame­works such as trans­fer learn­ing, IMO.

So ba­si­cally I think dae­mons/​in­ner op­ti­miz­ers/​what­ever you want to call them are go­ing to be the main safety prob­lem.

• Yes, per­haps I should’ve been more clear. Learn­ing cer­tain dis­tance func­tions is a prac­ti­cal solu­tion to some things, so maybe the phrase “dis­tance func­tions are hard” is too sim­plis­tic. What I meant to say is more like

Fully-speci­fied dis­tance func­tions are hard, over and above the difficulty of for­mally spec­i­fy­ing most things, and it’s of­ten hard to no­tice this difficulty

This is mostly ap­pli­ca­ble to Agent Foun­da­tions-like re­search, where we are try­ing to give a for­mal model of (some as­pect of) how agents work. Some­times, we can re­duce our prob­lem to defin­ing the ap­pro­pri­ate dis­tance func­tion, and it can feel like we’ve made some progress, but we haven’t ac­tu­ally got­ten any­where (the first two ex­am­ples in the post are like this).

The 3rd ex­am­ple, where we are try­ing to for­mally ver­ify an ML model against ad­ver­sar­ial ex­am­ples, is a bit differ­ent now that I think of it. Here we ap­par­ently need trans­par­ent, for­mally-speci­fied dis­tance func­tion if we have any hope of ab­solutely prov­ing the ab­sence of ad­ver­sar­ial ex­am­ples. And in for­mal ver­ifi­ca­tion, the speci­fi­ca­tion prob­lem of­ten is just philo­soph­i­cally hard like this. So I sup­pose this ex­am­ple is less in­sight­ful, ex­cept in­so­far as it lends ex­tra in­tu­itions for the other class of ex­am­ples.

• Here we ap­par­ently need trans­par­ent, for­mally-speci­fied dis­tance func­tion if we have any hope of ab­solutely prov­ing the ab­sence of ad­ver­sar­ial ex­am­ples.

Well, a clas­sifier that is 100% ac­cu­rate would also do the job ;) (I’m not sure a 100% ac­cu­rate clas­sifier is fea­si­ble per se, but a clas­sifier which can be made ar­bi­trar­ily ac­cu­rate given enough data/​com­pute/​life-long learn­ing ex­pe­rience seems po­ten­tially fea­si­ble.)

Also, small per­tur­ba­tions aren’t nec­es­sar­ily the only way to con­struct ad­ver­sar­ial ex­am­ples. Sup­pose I want to at­tack a model M1, which I have ac­cess to, and I also have a more ac­cu­rate model M2. Then I could ex­e­cute an au­to­mated search for cases where M1 and M2 dis­agree. (Maybe I use gra­di­ent de­scent on the in­put space, max­i­miz­ing an ob­jec­tive func­tion cor­re­spond­ing to the level of dis­agree­ment be­tween M1 and M2.) Then I hire peo­ple on Me­chan­i­cal Turk to look through the dis­agree­ments and flag the ones where M1 is wrong. (Since M2 is more ac­cu­rate, M1 will “usu­ally” be wrong.)

This is ac­tu­ally one way to look at what’s go­ing on with tra­di­tional small per­tur­ba­tion ad­ver­sar­ial ex­am­ples. M1 is a deep learn­ing model and M2 is a 1-near­est-neigh­bor model—not very good in gen­eral, but quite ac­cu­rate in the im­me­di­ate re­gion of data points with known la­bels. The prob­lem is that deep learn­ing mod­els don’t have a very strong in­duc­tive bias to­wards map­ping nearby in­puts to nearby out­puts (some­times called “Lip­s­chitz­ness”). L2 reg­u­lariza­tion ac­tu­ally makes deep learn­ing mod­els more Lip­s­chitz be­cause smaller co­effi­cients=smaller eigen­val­ues for weight ma­tri­ces=less ca­pac­ity to stretch nearby in­puts away from each other in out­put space. I think maybe that’s part of why L2 reg­u­lariza­tion works.

Hop­ing to ex­pand the pre­vi­ous two para­graphs into a pa­per with Matthew Bar­nett be­fore too long—if any­one wants to help us get it pub­lished, please send me a PM (nei­ther of us has ever pub­lished a pa­per be­fore).

• I’m not con­vinced con­cep­tual dis­tance met­rics must be value-laden. Rep­re­sent each util­ity func­tion by an AGI. Al­most all of them should be able to agree on a met­ric such that each could adopt that met­ric in its think­ing los­ing only neg­ligible value. The same could not be said for agree­ing on a util­ity func­tion. (The same could be said for agree­ing on a util­ity-parametrized AGI de­sign.)

• Rep­re­sent each util­ity func­tion by an AGI. Al­most all of them should be able to agree on a met­ric such that each could adopt that met­ric in its think­ing los­ing only neg­ligible value.

This im­plies a mea­sure over util­ity func­tions. Its propably true un­der the solomonoff mea­sure, but ab­stract though they are, this is val­ues.

• I think it’s that any ba­sis set I define in a su­per high di­men­sional space could be said to be value laden, though it might be tacit and I have lit­tle idea what it is. If I care about ‘causal struc­ture’ or some­thing that’s still rel­a­tive to the sorts of af­for­dances that are rele­vant to me in the space?

• Is this the same value pay­load that makes ac­tivists fight over lan­guage to make hu­man bi­ases work for their side? I don’t think this prob­lem trans­lates to AI: If the AGIs find that some met­ric in­duces some bias, each can com­pen­sate for it.

• Its sort of true that the cor­rect dis­tance func­tion de­pends on your val­ues. A bet­ter way to say it is that differ­ent dis­tance func­tions are ap­pro­pri­ate for differ­ent tasks, and they will be “bet­ter” or “worse” de­pend­ing on how much you care about those tasks. But I dont think ask­ing for the “best” met­ric in this sense is helpful, be­cause you dont have to use the same met­ric for all tasks in­volv­ing a cer­tain space. Some­times you want air dis­tance, some­times travel times. Maybe you have to de­cide be­cause youre com­pu­ta­tion­ally limited, but its not philo­soph­i­cally rele­vant.

With that in mind, my at­tempts at two of your ex­am­ples. The ad­ver­sar­ial ex­am­ples first, be­cause its the clear­est ques­tion: I think the prob­lem is that you are think­ing too ab­stractly. I dont think there is a mean­ingful sense of “con­cept similar­ity” thats purely log­i­cal, i.e. in­de­pen­dent of the ac­tual world. The in­tu­itive sense of similar­ity youre try­ing to use here is propably some­thing like this: Over the space of images, you want the propa­bil­ity mea­sure of en­coun­ter­ing them. Then you get a met­ric where two sub­sets of images­pace which are iso­mor­phic un­der the met­ric always have the same mea­sure. That is your similar­ity mea­sure.

Coun­ter­fac­tu­als usu­ally in­volve some sort of propa­bil­ity dis­tri­bu­tion, which is then “up­dated” on the con­di­tion of the coun­ter­fac­tual be­ing true, and then the con­se­quent is judged un­der that dis­tri­bu­tion. What the ini­tial dis­tri­bu­tion is de­pends on what youre do­ing. In the case of Lin­coln, its propably rea­son­able ex­pec­ta­tions of the fu­ture from be­fore the as­sas­i­na­tion. But for some­thing like “What if con­ser­va­tion of en­ergy wasnt true”, its propably our cur­rent dis­tri­bu­tion over physics the­o­ries. Ba­si­cally, whats the most likely al­ter­na­tive. The math­e­mat­i­cal ex­am­ple is a bit differ­ent. There lot of ways to con­clude a con­tra­dic­tion from 0=1, but its very hard to de­duce a con­tra­dic­tion from deny­ing the mod­u­lar­ity the­o­rem. If you were to just ran­domly perform log­i­cal in­fer­ences from “the mod­u­lar­ity the­o­rem is wrong”, then there is a sub­set of propo­si­tions which doesnt in­clude any claim that is a dircet nega­tion of an­other in it, that your de­duc­tions are un­likely to lead you out of (it mat­ters of course, in what way it is ran­dom, but it ev­i­dently works for “hu­man math­mat­i­cian who hasnt seen the proof yet”).

• “If Lin­coln were not as­sas­si­nated, he would not have been im­peached” is a prob­a­bil­is­tic state­ment that is not at all about THE Lin­coln. It’s a refer­ence class anal­y­sis of lead­ers who did not suc­cumb to pre­ma­ture death and had the lead­er­ship, econ­omy etc. met­rics similar to the one Lin­coln. There is no “coun­ter­fac­tual” there in any in­ter­est­ing sense. It is not about the minute de­tails of avoid­ing the as­sas­si­na­tion. If you state the ap­par­ent coun­ter­fac­tual more pre­cisely, it would be some­thing like

There is a 90% prob­a­bil­ity of a ruler with [list of char­ac­ter­is­tics match­ing Lin­coln, ac­cord­ing to some crite­ria] serv­ing out his term.

So, there is no is­sue with “If 0=1...” here, un­like with the other one, “If the mod­u­lar­ity the­o­rem were false”, which im­plies some changes in the very ba­sics of math­e­mat­ics, though one can also ar­gue for the refer­ence class ap­proach there.

• I feel like this is prac­ti­cally a fre­quen­tist/​bayesian dis­agree­ment :D It seems “ob­vi­ous” to me that “If Lin­coln were not as­sas­si­nated, he would not have been im­peached” can be about the real Lin­coln as much as me say­ing “Lin­coln had a beard” is, be­cause both are state­ments made us­ing my model of the world about this thing I la­bel Lin­coln. No refer­ence class nec­es­sary.

• I am not sure if la­bels help here. I’m sim­ply point­ing out that log­i­cal coun­ter­fac­tu­als ap­plied to the “real Lin­coln” lead to the sort of is­sues MIRI is fac­ing right now when try­ing to make progress in the the­o­ret­i­cal AI al­ign­ment is­sues. The refer­ence class ap­proach re­moves the difficul­ties, but then it is hard to ap­ply it to the “math­e­mat­i­cal facts”, like what is the prob­a­bil­ity of 100...0th digit of pi be­ing 0 or, to quote the OP “If the Mo­du­lar­ity The­o­rem were false...” and the pre­vailing MIRI philos­o­phy does not al­low treat­ing log­i­cal un­cer­tainty as en­vi­ron­men­tal.

• Sure. In the case of Lin­coln, I would say the prob­lem is solved by mod­els even as clean as Pearl-ian causal net­works. But in math, there’s no prin­ci­pled causal net­work model of the­o­rems to sup­port coun­ter­fac­tual rea­son­ing as causal calcu­lus.

Of course, I more or less just think that we have an un­prin­ci­pled causal­ity-like view of math that we take when we think about math­e­mat­i­cal coun­ter­fac­tu­als, but it’s not clear that this is any help to MIRI un­der­stand­ing proof-based AI.

• I don’t think I am fol­low­ing your ar­gu­ment. I am not sure what Pearl’s causal net­works are and how they help here, so maybe I need to read up on it.