Dis­cus­sion on the ma­chine learn­ing ap­proach to AI safety

(Cross-pos­ted from the Deep Safety blog.)

At this year’s EA Global Lon­don con­fer­ence, Jan Leike and I ran a dis­cus­sion ses­sion on the ma­chine learn­ing ap­proach to AI safety. We ex­plored some of the as­sump­tions and con­sid­er­a­tions that come up as we re­flect on dif­fer­ent re­search agen­das. Slides for the dis­cus­sion can be found here.

The dis­cus­sion fo­cused on two top­ics. The first topic ex­amined as­sump­tions made by the ML safety ap­proach as a whole, based on the blog post Con­cep­tual is­sues in AI safety: the paradig­matic gap. The second topic zoomed into spe­cific­a­tion prob­lems, which both of us work on, and com­pared our ap­proaches to these prob­lems.

As­sump­tions in ML safety

The ML safety ap­proach fo­cuses on safety prob­lems that can be ex­pec­ted to arise for ad­vanced AI and can be in­vest­ig­ated in cur­rent AI sys­tems. This is dis­tinct from the found­a­tions ap­proach, which con­siders safety prob­lems that can be ex­pec­ted to arise for su­per­in­tel­li­gent AI, and de­vel­ops the­or­et­ical ap­proaches to un­der­stand­ing and solv­ing these prob­lems from first prin­ciples.

While test­ing on cur­rent sys­tems provides a use­ful em­pir­ical feed­back loop, there is a con­cern that the res­ult­ing solu­tions might not be rel­ev­ant for more ad­vanced sys­tems, which could po­ten­tially be very dif­fer­ent from cur­rent ones. The Paradig­matic Gap post made an ana­logy between try­ing to solve safety prob­lems with gen­eral AI us­ing today’s sys­tems as a model, and try­ing to solve safety prob­lems with cars in the horse and buggy era. The horse to car trans­ition is an ex­ample of a paradigm shift that renders many cur­rent is­sues ir­rel­ev­ant (e.g. horse waste and car­casses in the streets) and in­tro­duces new ones (e.g. air pol­lu­tion). A paradigm shift of that scale in AI would be deep learn­ing or re­in­force­ment learn­ing be­com­ing ob­sol­ete.

If we ima­gine liv­ing in the horse car­riage era, could we use­fully con­sider pos­sible safety is­sues in fu­ture trans­port­a­tion? We could in­vent brakes and stop lights to pre­vent col­li­sions between horse car­riages, or seat belts to pro­tect people in those col­li­sions, which would be rel­ev­ant for cars too. Jan poin­ted out that we could con­sider a hy­po­thet­ical scen­ario with su­per-fast horses and come up with cor­res­pond­ing safety meas­ures (e.g. ped­es­trian-free high­ways). Of course, we might also con­sider pos­sible neg­at­ive ef­fects on the hu­man body from mov­ing at high speeds, which turned out not to be an is­sue. When try­ing to pre­dict prob­lems with power­ful fu­ture tech­no­logy, some of the con­cerns are likely to be un­foun­ded—this seems like a reas­on­able price to pay for be­ing pro­act­ive on the con­cerns that do pan out.

Get­ting back to ma­chine learn­ing, the Paradig­matic Gap post had a handy list of as­sump­tions that could po­ten­tially lead to paradigm shifts in the fu­ture. We went through this list, and rated each of them based on how much we think ML safety work is re­ly­ing on it and how likely it is to hold up for gen­eral AI sys­tems (on a 1-10 scale). These rat­ings were based on a quick in­tu­it­ive judg­ment rather than pro­longed re­flec­tion, and are not set in stone.

Our rat­ings agreed on most of the as­sump­tions:

  • We strongly rely on re­in­force­ment learn­ing, defined as the gen­eral frame­work where an agent in­ter­acts with an en­vir­on­ment and re­ceives some sort of re­ward sig­nal, which also in­cludes meth­ods like evol­u­tion­ary strategies. We would be very sur­prised if gen­eral AI did not op­er­ate in this frame­work.

  • Dis­crete ac­tion spaces (#5) is the only as­sump­tion that we strongly rely on but don’t strongly ex­pect to hold up. I would ex­pect ef­fect­ive tech­niques for dis­cret­iz­ing con­tinu­ous ac­tion spaces to be de­veloped in the fu­ture, so I’m not sure how much of an is­sue this is.

  • The train/​test re­gime, MDPs and sta­tion­ar­ity can be use­ful for pro­to­typ­ing safety meth­ods but don’t seem dif­fi­cult to gen­er­al­ize from.

  • A sig­ni­fic­ant part of safety work fo­cuses on design­ing good ob­ject­ive func­tions for AI sys­tems, and does not de­pend on prop­er­ties of the ar­chi­tec­ture like gradi­ent-based learn­ing and para­met­ric mod­els (#7-8).

We weren’t sure how to in­ter­pret some of the as­sump­tions on the list, so our rat­ings and dis­agree­ments on these are not set in stone:

  • We in­ter­preted #6 as hav­ing a fixed ac­tion space where the agent can­not in­vent new ac­tions.

  • Our dis­agree­ment about re­li­ance on #9 was prob­ably due to dif­fer­ent in­ter­pret­a­tions of what it means to op­tim­ize dis­crete tasks. I in­ter­preted it as the agent be­ing trained from scratch for spe­cific tasks, while Jan in­ter­preted it as the agent hav­ing some kind of ob­ject­ive (po­ten­tially very high-level like “max­im­ize hu­man ap­proval”).

  • Not sure what it would mean not to use prob­ab­il­istic in­fer­ence or what the al­tern­at­ive could be.

An ad­di­tional as­sump­tion men­tioned by the audi­ence is the agent/​en­vir­on­ment sep­ar­a­tion. The al­tern­at­ive to this as­sump­tion is be­ing ex­plored by MIRI in their work on em­bed­ded agents. I think we rely on this a lot, and it’s mod­er­ately likely to hold up.

Vishal poin­ted out a gen­eral ar­gu­ment for the cur­rent as­sump­tions hold­ing up. If there are many ways to build gen­eral AI, then the ap­proaches with a lot of re­sources and ef­fort be­hind them are more likely to suc­ceed (as­sum­ing that the ap­proach in ques­tion could pro­duce gen­eral AI in prin­ciple).

Ap­proaches to spe­cific­a­tion problems

Spe­cific­a­tion prob­lems arise when spe­cify­ing the ob­ject­ive of the AI sys­tem. An ob­ject­ive spe­cific­a­tion is a proxy for the hu­man de­signer’s pref­er­ences. A mis­spe­cified ob­ject­ive that does not match the hu­man’s in­ten­tion can res­ult in un­desir­able be­ha­vi­ors that max­im­ize the proxy but don’t solve the goal.

There are two classes of ap­proaches to spe­cific­a­tion prob­lems—hu­man-in-the-loop (e.g. re­ward learn­ing or it­er­ated amp­li­fic­a­tion) and prob­lem-spe­cific (e.g. side ef­fects /​ im­pact meas­ures or re­ward cor­rup­tion the­ory). Hu­man-in-the-loop ap­proaches are more gen­eral and can ad­dress many safety prob­lems at once, but it can be dif­fi­cult to tell when the res­ult­ing sys­tem has re­ceived the right amount and type of hu­man data to pro­duce safe be­ha­vior. The two ap­proaches are com­ple­ment­ary, and could be com­bined by us­ing prob­lem-spe­cific solu­tions as an in­duct­ive bias for hu­man-in-the-loop ap­proaches.

We con­sidered some ar­gu­ments for us­ing pure hu­man-in-the-loop learn­ing and for com­ple­ment­ing it with prob­lem-spe­cific ap­proaches, and rated how strong each ar­gu­ment is (on a 1-10 scale):

Our rat­ings agreed on most of these ar­gu­ments:

  • The strongest reas­ons to fo­cus on pure hu­man-in-the-loop learn­ing are #1 (get­ting some safety-rel­ev­ant con­cepts for free) and #3 (un­known un­knowns).

  • The strongest reas­ons to com­ple­ment with prob­lem-spe­cific ap­proaches are #1 (use­ful in­duct­ive bias) and #3 (higher un­der­stand­ing and trust).

  • Re­ward hack­ing is an is­sue for non-ad­apt­ive meth­ods is an is­sue (e.g. if the re­ward func­tion is frozen in re­ward learn­ing), but mak­ing prob­lems-spe­cific ap­proaches ad­apt­ive does not seem that hard.

The main dis­agree­ment was about ar­gu­ment #4 for com­ple­ment­ing—Jan ex­pects the ob­ject­ives/​strategies is­sues to be more eas­ily solv­able than I do.

Over­all, this was an in­ter­est­ing con­ver­sa­tion that helped cla­rify my un­der­stand­ing of the safety re­search land­scape. This is part of a longer con­ver­sa­tion that is very much a work in pro­gress, and we ex­pect to con­tinue dis­cuss­ing and up­dat­ing our views on these con­sid­er­a­tions.