Karma: 614

# The re­cent NeurIPS call for pa­pers re­quires au­thors to in­clude a state­ment about the po­ten­tial broader im­pact of their work

24 Feb 2020 7:44 UTC
12 points
• My un­der­stand­ing is that am­plifi­ca­tion-based ap­proaches are meant to tackle in­ner al­ign­ment by us­ing the am­plified sys­tems that are already trusted (e.g. hu­mans + many in­vo­ca­tions of a trusted model) to miti­gate in­ner al­ign­ment prob­lems in the next (slightly more pow­er­ful) mod­els that are be­ing trained. A few ap­proaches for this have already been sug­gested (I’m not aware of pub­lished em­piri­cal re­sults), see Evan’s com­ment for some poin­t­ers.

I hope a lot more re­search will be done on this topic. It’s not clear to me whether we should ex­pect to have am­plified sys­tems that al­low us to miti­gate in­ner al­ign­ment risks to a satis­fac­tory ex­tent be­fore the point where we have x-risk pos­ing sys­tems, how can we make that more likely, and if it’s not fea­si­ble how do we re­al­ize that as soon as pos­si­ble?

• 15 Feb 2020 10:57 UTC
LW: 3 AF: 2
AF

It might be that the evolv­ing-to-ex­tinc­tion policy of mak­ing the world harder to pre­dict through logs is com­pli­cated enough that it can only emerge through a de­cep­tive ticket de­cid­ing to pur­sue it—or it could be the case that it’s sim­ple enough that one ticket could ran­domly start writ­ing stuff to logs, get se­lected for, and end up pur­su­ing such a policy with­out ever ac­tu­ally hav­ing come up with it ex­plic­itly.

I’m not sure about the lat­ter. Sup­pose there is a “sim­ple” ticket that ran­domly writes stuff to the logs in a way that makes fu­ture train­ing ex­am­ples harder to pre­dict. I don’t see what would cause that ticket to be se­lected for.

• This doesn’t re­quire AI, it hap­pens any­where that com­pet­ing prices are eas­ily available and fairly muta­ble.

It hap­pens with­out AI to some ex­tent, but if a lot of busi­nesses will be set­ting prices via RL based sys­tems (which seems to me likely), then I think it may hap­pen to a much greater ex­tent. Con­sider that in the ex­am­ple above, it may be very hard for the five bar­bers to co­or­di­nate a $3 price in­crease with­out any com­mu­ni­ca­tion (and with­out AI) if, by as­sump­tion, the only Nash equil­ibrium is the state where all the five bar­bers charge mar­ket prices. AI will be no more nor less li­able than hu­mans mak­ing the same de­ci­sions would be. Peo­ple some­times go to jail for ille­gally co­or­di­nat­ing prices with com­peti­tors; I don’t see how an an­titrust en­force­ment agency will hold any­one li­able in the above ex­am­ple. • Sup­pose the code of the deep RL al­gorithm that was used to train the huge policy net­work is pub­li­cly available on GitHub, as well as ev­ery­thing else that was used to train the policy net­work, plus the fi­nal policy net­work it­self. How can an an­titrust en­force­ment agency use all this to de­ter­mine whether an an­titrust vi­o­la­tion has oc­curred? (in the above ex­am­ple) • I’m cu­ri­ous how an­titrust en­force­ment will be able to deal with progress in AI. (I know very lit­tle about an­titrust laws.) Imag­ine a small town with five bar­ber­shops. Sup­pose an an­titrust law makes it ille­gal for the five bar­ber­shop own­ers to have a meet­ing in which they all com­mit to in­crease prices by$3.

Sup­pose that each of the five bar­ber­shops will de­cide to start us­ing some off-the-shelf deep RL based solu­tion to set their prices. Sup­pose that af­ter some time in which they’re all us­ing such sys­tems, lo and be­hold, they all grad­u­ally in­crease prices by \$3. If the rele­vant gov­ern­ment agency no­tices this, who can they po­ten­tially ac­cuse of com­mit­ting a crime? Each bar­ber­shop owner is just set­ting their prices to what­ever their off-the-shelf sys­tem recom­mends (and that sys­tem is a huge neu­ral net­work that no one un­der­stands at a rele­vant level of ab­strac­tion).

• Very in­ter­est­ing :)

I sus­pect the model is mak­ing a hid­den as­sump­tion about the lack of “spe­cial pro­jects”; e.g. the model as­sumes there can’t be a sin­gle pro­ject that yields a bonus that makes all the other pro­jects’ tasks in­stantly solv­able?

Also, I’m not sure that the model al­lows us to dis­t­in­guish be­tween sce­nar­ios in which a ma­jor part of over­all progress is very lo­cal (e.g. hap­pens within a sin­gle com­pany) and more Han­so­nian sce­nar­ios in which the con­tri­bu­tion to progress is well dis­tributed among many ac­tors.

• the failure mode of an amoral AI sys­tem that doesn’t care about you seems both more likely and more amenable to tech­ni­cal safety ap­proaches (to me at least).

It seems to me that at least some parts of this re­search agenda are rele­vant for some spe­cial cases of “the failure mode of an amoral AI sys­tem that doesn’t care about you”. A lot of con­tem­po­rary AIS re­search as­sumes some kind of hu­man-in-the-loop setup (e.g. am­plifi­ca­tion/​de­bate, re­cur­sive re­ward mod­el­ing) and for such se­tups it seems rele­vant to con­sider ques­tions like “un­der what cir­cum­stances do hu­mans in­ter­act­ing with an ar­tifi­cial agent be­come con­vinced that the agent’s com­mit­ments are cred­ible?”. Such ques­tions seem rele­vant un­der a very wide range of moral sys­tems (in­clud­ing ones that don’t place much weight on s-risks).

• The fol­low­ing quoted texts are from this post by Scott Alexan­der:

Alan Tur­ing:

Let us now as­sume, for the sake of ar­gu­ment, that these ma­chines are a gen­uine pos­si­bil­ity, and look at the con­se­quences of con­struct­ing them. To do so would of course meet with great op­po­si­tion, un­less we have ad­vanced greatly in re­li­gious tol­er­ance since the days of Gal­ileo. There would be great op­po­si­tion from the in­tel­lec­tu­als who were afraid of be­ing put out of a job. It is prob­a­ble though that the in­tel­lec­tu­als would be mis­taken about this. There would be plenty to do in try­ing to keep one’s in­tel­li­gence up to the stan­dards set by the ma­chines, for it seems prob­a­ble that once the ma­chine think­ing method had started, it would not take long to out­strip our fee­ble pow­ers…At some stage there­fore we should have to ex­pect the ma­chines to take con­trol.

[EDIT: a similar text, at­tributed to Alan Tur­ing, ap­pears here (from the last para­graph) - con­tinued here.]

I. J. Good:

Let an ul­train­tel­li­gent ma­chine be defined as a ma­chine that can far sur­pass all the in­tel­lec­tual ac­tivi­ties of any man how­ever clever. Since the de­sign of ma­chines is one of these in­tel­lec­tual ac­tivi­ties, an ul­train­tel­li­gent ma­chine could de­sign even bet­ter ma­chines; there would then un­ques­tion­ably be an ‘in­tel­li­gence ex­plo­sion,’ and the in­tel­li­gence of man would be left far be­hind. Thus the first ul­train­tel­li­gent ma­chine is the last in­ven­tion that man need ever make

[EDIT: I didn’t man­age to ver­ify it yet, but it seems that that last quote is from a 58 page pa­per by I. J. Good, ti­tled Spec­u­la­tions Con­cern­ing the First Ul­train­tel­li­gent Ma­chine; here is an archived ver­sion of the bro­ken link in Scott’s post.]

• I want to flag that—in the case of evolu­tion­ary al­gorithms—we should not as­sume here that the fit­ness func­tion is defined with re­spect to just the cur­rent batch of images, but rather with re­spect to, say, all past images so far (since the be­gin­ning of the en­tire train­ing pro­cess); oth­er­wise the se­lec­tion pres­sure is “my­opic” (i.e. mod­els that out­perform oth­ers on the cur­rent batch of images have higher fit­ness).

• in­stead, if there are hy­per­pa­ram­e­ters that pre­vent the er­ror rate go­ing be­low 0.1, these will be se­lected by gra­di­ent de­scent as giv­ing a bet­ter perfor­mance.

I don’t fol­low this point. If we’re talk­ing about us­ing SGD to up­date (hy­per)pa­ram­e­ters, us­ing a batch of images from the cur­rently used datasets, then the gra­di­ent up­date would be de­ter­mined by the gra­di­ent of the loss with re­spect to that batch of images.

• Let H:Q→A be a hu­man.

[...]

Let Amp(H,M)(Q)=H(“What an­swer would you give to Q given ac­cess to M?”).

Nit­pick: is meant to be defined here as a hu­man with ac­cess to ?

• So: is it pos­si­ble to for­mu­late an in­stru­men­tal ver­sion of Oc­cam? Can we jus­tify a sim­plic­ity bias in our poli­cies?

Maybe prob­lems that don’t have sim­ple solu­tions (i.e. all their solu­tions have a large de­scrip­tion length) are usu­ally in­tractable for us. If so, given a prob­lem that we’re try­ing to solve, the as­sump­tion that it has sim­ple solu­tions is prob­a­bly ei­ther use­ful (if it’s true) or costless (if it isn’t). In other words: “look for your miss­ing key un­der the lamp­post, not be­cause it’s prob­a­bly there, but be­cause you’ll only ever find it if it’s there”.

• 1 Feb 2020 7:09 UTC
LW: 5 AF: 3
AF

I wasn’t claiming that there’ll be an ex­plicit OR gate, just some­thing func­tion­ally equiv­a­lent to it.

Sure, we’re on the same page here. I think by “There’s still a gra­di­ent sig­nal to change the OR gate” you mean ex­actly what I meant when I said “that would just be pass­ing the buck to the out­put of that OR”.

I’m not sure I un­der­stand 2 and 3. The ac­ti­va­tions are in prac­tice dis­crete (e.g. rep­re­sented by 32 bits), and so the sub­net­works can be de­signed such that they never out­put val­ues within the range (if that’s im­por­tant/​use­ful for the mechanism to work).

It’s non-ob­vi­ous that agents will have any­where near enough con­trol over their in­ter­nal func­tion­ing to set up such sys­tems. Have you ever tried im­ple­ment­ing two novel in­de­pen­dent iden­ti­cal sub­mod­ules in your brain?

Hu­mans can’t con­trol their brain in the level of ab­strac­tion of neu­rons—by think­ing alone—but in a higher level of ab­strac­tion they do have some con­trol that can be use­ful. For ex­am­ple, con­sider a hu­man in a New­comb’s prob­lem that de­cides to 1-box. Ar­guably, they rea­son in a cer­tain way in or­der to make their brain have a cer­tain prop­erty (namely, be­ing a brain that de­cides to 1-box in a New­comb’s prob­lem).

(In­de­pen­dence is very tricky be­cause they’re part of the same plan, and so a change in your un­der­ly­ing mo­ti­va­tion to pur­sue that plan af­fects both).

Per­haps I shouldn’t have used the word “in­de­pen­dent”; I just meant that the out­put of one sub­net­work does not af­fect the out­put of the other (dur­ing any given in­fer­ence).

• 30 Jan 2020 8:27 UTC
LW: 3 AF: 2
AF

Also note that the OR func­tion is not differ­en­tiable, and so the two sub­net­works must be im­ple­ment­ing some con­tin­u­ous ap­prox­i­ma­tion to it. In that case, it seems likely to me that there’s a gra­di­ent sig­nal to change the failing-hard mechanism.

I didn’t mean feed­ing the out­puts of the two sub­net­works to an OR ded­i­cated to that pur­pose (that would just be pass­ing the buck to the out­put of that OR). Sup­pose in­stead that the task is clas­sify­ing cat/​dog images and that each sub­net­work can in­de­pen­dently cause the net­work to clas­sify a dog image as a cat by mess­ing with a com­pletely differ­ent piece of logic (e.g. one sub­net­work is do­ing the equiv­a­lent of caus­ing a false de­tec­tion of cat whiskers, and the other is do­ing the equiv­a­lent of caus­ing a false de­tec­tion of cat eyes) such that the loss of the model is similar if any of the two sub­net­works or both “de­cide to make the model fail”.

I want to em­pha­size that I don’t ar­gue that we should be con­cerned about such so­phis­ti­cated mechanisms ran­domly ap­pear­ing dur­ing train­ing. I ar­gue that, if a huge neu­ral net­work im­ple­ments a suffi­ciently pow­er­ful op­ti­miza­tion pro­cess with a goal sys­tem that in­volves our world, then it seems pos­si­ble that that op­ti­miza­tion pro­cess would con­struct such so­phis­ti­cated mechanisms within the neu­ral net­work. (And so the above is merely an ar­gu­ment that such the­o­ret­i­cal mechanisms ex­ist, not that they are easy to con­struct.)

• In­ner al­ign­ment says, well, it’s not ex­actly like that. There’s go­ing to be a loss func­tion used to train our AIs, and the AIs them­selves will have in­ter­nal ob­jec­tive func­tions that they are max­i­miz­ing, and both of these might not match ours.

As I un­der­stand the lan­guage, the “loss func­tion used to train our AIs” matches “our ob­jec­tive func­tion” from the clas­si­cal outer al­ign­ment prob­lem. The in­ner al­ign­ment prob­lem seems to me as a sep­a­rate prob­lem rather than a “re­fine­ment of the tra­di­tional ar­gu­ment” (we can fail due to just an in­ner al­ign­ment prob­lem; and we can fail due to just an outer al­ign­ment prob­lem).

My un­der­stand­ing is that he spent one chap­ter talk­ing about mul­ti­po­lar out­comes, and the rest of the book talk­ing about unipo­lar outcomes

I’m not sure what you mean by say­ing “the rest of the book talk­ing about unipo­lar out­comes”. In what way do the parts in the book that dis­cuss the or­thog­o­nal­ity the­sis, in­stru­men­tal con­ver­gence and Good­hart’s law as­sume or de­pend on a unipo­lar out­come?

This is im­por­tant be­cause if you have the point of view that AI safety must be solved ahead of time, be­fore we ac­tu­ally build the pow­er­ful sys­tems, then I would want to see spe­cific tech­ni­cal rea­sons for why it will be so hard that we won’t solve it dur­ing the de­vel­op­ment of those sys­tems.

Can you give an ex­am­ple of a hy­po­thet­i­cal fu­ture AI sys­tem—or some out­come thereof—that should in­di­cate that hu­mankind ought to start work­ing a lot more on AI safety?

• 29 Jan 2020 19:12 UTC
LW: 3 AF: 2
AF

the gra­di­ents will point in the di­rec­tion of re­mov­ing the penalty by re­duc­ing the agent’s de­ter­mi­na­tion to fail upon de­tect­ing goal shift.

But it need not be the case, and in­deed the “failing-hard mechanism” would be op­ti­mized for that to not be the case (in a gra­di­ent hack­ing sce­nario).

To quickly see that it need not be the case, sup­pose that the “failing-hard mechanism” is im­ple­mented as two sub­net­works within the model such that each one of them can out­put a value that causes the model to fail hard, and they are de­signed to ei­ther both out­put such a value or both not out­put such a value. Chang­ing any sin­gle weight within the two sub­net­works would not break the “failing-hard mechanism”, and so we can ex­pect all the par­tial deriva­tives with re­spect to weights within the two sub­net­works to be close to zero (i.e. up­dat­ing the weights in the di­rec­tion of the gra­di­ent would not de­stroy the “failing-hard mechanism”).

• If the old ar­gu­ments were sound, why would re­searchers shift their ar­gu­ments in or­der to make the case that AI posed a risk? I’d as­sume that if the old ar­gu­ments worked, the new ones would be a re­fine­ment rather than a shift. In­deed many old ar­gu­ments were re­fined, but a lot of the new ar­gu­ments seem very new.

I’m not sure I un­der­stand your model. Sup­pose AI safety re­searcher Alice writes a post about a prob­lem that Nick Bostrom did not dis­cuss in Su­per­in­tel­li­gence back in 2014 (e.g. the in­ner al­ign­ment prob­lem). That doesn’t seem to me like mean­ingful ev­i­dence for the propo­si­tion “the ar­gu­ments in Su­per­in­tel­li­gence are not sound”.

I can’t speak for oth­ers, but the gen­eral no­tion of there be­ing a sin­gle pro­ject that leaps ahead of the rest of the world, and gains su­per­in­tel­li­gent com­pe­tence be­fore any other team can even get close, seems sus­pi­cious to many re­searchers that I’ve talked to.

It’s been a while since I read listened to the au­dio­book ver­sion of Su­per­in­tel­li­gence, but I don’t re­call the book ar­gu­ing that the “sec­ond‐place AI lab” will likely be much far be­hind the lead­ing AI lab (in sub­jec­tive hu­man time) be­fore we get su­per­in­tel­li­gence. And even if it would have ar­gued for that, as im­por­tant as such an es­ti­mate may be, how is it rele­vant to the ba­sic ques­tion of whether AI Safety is some­thing hu­mankind should be think­ing about?

In gen­eral, the no­tion that there will be dis­con­ti­nu­ities in de­vel­op­ment is looked with sus­pi­cion by a num­ber of peo­ple (though, no­tably some re­searchers still think that fast take­off is likely).

I don’t re­call the book rely­ing on (or [EDIT: with a lot less con­fi­dence] even men­tion­ing the pos­si­bil­ity of) a dis­con­ti­nu­ity in ca­pa­bil­ities. I be­lieve it does ar­gue that once there are AI sys­tems that can do any­thing hu­mans can, we can ex­pect ex­tremely fast progress.

• and there’s been a shift in ar­gu­ments.

The set of ar­gu­ments that are be­ing ac­tively dis­cussed by AI safety re­searchers ob­vi­ously changed since 2014 (which is true for any ac­tive field?). I as­sume that by “there’s been a shift in ar­gu­ments” you mean some­thing more than that, but I’m not sure what.

Is there any core ar­gu­ment in the book Su­per­in­tel­li­gence that is no longer widely ac­cepted among AI safety re­searchers? Does the progress in deep learn­ing since 2014 made the core ar­gu­ments in the book less com­pel­ling? (Do the ar­gu­ments about in­stru­men­tal con­ver­gence and Good­hart’s law fail to ap­ply to deep RL?)

• I just want to flag that this ap­proach seems to as­sume that—be­fore we build the Or­a­cle—we de­sign the Or­a­cle (or the pro­ce­dure that pro­duces it) such that it will as­sign prior of zero to the sec­ond types of wor­lds.

If we use some ar­bi­trary scaled-up su­per­vised learn­ing train­ing pro­cess to train a model that does well on gen­eral ques­tion an­swer­ing, we can’t just safely sidestep the ma­lign prior prob­lem by pro­vid­ing in­for­ma­tion/​in­struc­tions about the prior as part of the ques­tion. The simu­la­tions of the model that dis­tant su­per­in­tel­li­gences run may in­volve such in­puts as well. (In those simu­la­tions the loss may end up be­ing min­i­mal for what­ever out­put the su­per­in­tel­li­gence wants the model to yield; re­gard­less of the pre­scrip­tive in­for­ma­tion about the prior in the in­put.)