Discussion on the machine learning approach to AI safety

(Cross-posted from the Deep Safety blog.)

At this year’s EA Global Lon­don con­fer­ence, Jan Leike and I ran a dis­cus­sion ses­sion on the ma­chine learn­ing ap­proach to AI safety. We ex­plored some of the as­sump­tions and con­sid­er­a­tions that come up as we re­flect on differ­ent re­search agen­das. Slides for the dis­cus­sion can be found here.

The dis­cus­sion fo­cused on two top­ics. The first topic ex­am­ined as­sump­tions made by the ML safety ap­proach as a whole, based on the blog post Con­cep­tual is­sues in AI safety: the paradig­matic gap. The sec­ond topic zoomed into speci­fi­ca­tion prob­lems, which both of us work on, and com­pared our ap­proaches to these prob­lems.

As­sump­tions in ML safety

The ML safety ap­proach fo­cuses on safety prob­lems that can be ex­pected to arise for ad­vanced AI and can be in­ves­ti­gated in cur­rent AI sys­tems. This is dis­tinct from the foun­da­tions ap­proach, which con­sid­ers safety prob­lems that can be ex­pected to arise for su­per­in­tel­li­gent AI, and de­vel­ops the­o­ret­i­cal ap­proaches to un­der­stand­ing and solv­ing these prob­lems from first prin­ci­ples.

While test­ing on cur­rent sys­tems pro­vides a use­ful em­piri­cal feed­back loop, there is a con­cern that the re­sult­ing solu­tions might not be rele­vant for more ad­vanced sys­tems, which could po­ten­tially be very differ­ent from cur­rent ones. The Paradig­matic Gap post made an anal­ogy be­tween try­ing to solve safety prob­lems with gen­eral AI us­ing to­day’s sys­tems as a model, and try­ing to solve safety prob­lems with cars in the horse and buggy era. The horse to car tran­si­tion is an ex­am­ple of a paradigm shift that ren­ders many cur­rent is­sues ir­rele­vant (e.g. horse waste and car­casses in the streets) and in­tro­duces new ones (e.g. air pol­lu­tion). A paradigm shift of that scale in AI would be deep learn­ing or re­in­force­ment learn­ing be­com­ing ob­so­lete.

If we imag­ine liv­ing in the horse car­riage era, could we use­fully con­sider pos­si­ble safety is­sues in fu­ture trans­porta­tion? We could in­vent brakes and stop lights to pre­vent col­li­sions be­tween horse car­riages, or seat belts to pro­tect peo­ple in those col­li­sions, which would be rele­vant for cars too. Jan pointed out that we could con­sider a hy­po­thet­i­cal sce­nario with su­per-fast horses and come up with cor­re­spond­ing safety mea­sures (e.g. pedes­trian-free high­ways). Of course, we might also con­sider pos­si­ble nega­tive effects on the hu­man body from mov­ing at high speeds, which turned out not to be an is­sue. When try­ing to pre­dict prob­lems with pow­er­ful fu­ture tech­nol­ogy, some of the con­cerns are likely to be un­founded—this seems like a rea­son­able price to pay for be­ing proac­tive on the con­cerns that do pan out.

Get­ting back to ma­chine learn­ing, the Paradig­matic Gap post had a handy list of as­sump­tions that could po­ten­tially lead to paradigm shifts in the fu­ture. We went through this list, and rated each of them based on how much we think ML safety work is rely­ing on it and how likely it is to hold up for gen­eral AI sys­tems (on a 1-10 scale). Th­ese rat­ings were based on a quick in­tu­itive judg­ment rather than pro­longed re­flec­tion, and are not set in stone.

Our rat­ings agreed on most of the as­sump­tions:

  • We strongly rely on re­in­force­ment learn­ing, defined as the gen­eral frame­work where an agent in­ter­acts with an en­vi­ron­ment and re­ceives some sort of re­ward sig­nal, which also in­cludes meth­ods like evolu­tion­ary strate­gies. We would be very sur­prised if gen­eral AI did not op­er­ate in this frame­work.

  • Discrete ac­tion spaces (#5) is the only as­sump­tion that we strongly rely on but don’t strongly ex­pect to hold up. I would ex­pect effec­tive tech­niques for dis­cretiz­ing con­tin­u­ous ac­tion spaces to be de­vel­oped in the fu­ture, so I’m not sure how much of an is­sue this is.

  • The train/​test regime, MDPs and sta­tion­ar­ity can be use­ful for pro­to­typ­ing safety meth­ods but don’t seem difficult to gen­er­al­ize from.

  • A sig­nifi­cant part of safety work fo­cuses on de­sign­ing good ob­jec­tive func­tions for AI sys­tems, and does not de­pend on prop­er­ties of the ar­chi­tec­ture like gra­di­ent-based learn­ing and para­met­ric mod­els (#7-8).

We weren’t sure how to in­ter­pret some of the as­sump­tions on the list, so our rat­ings and dis­agree­ments on these are not set in stone:

  • We in­ter­preted #6 as hav­ing a fixed ac­tion space where the agent can­not in­vent new ac­tions.

  • Our dis­agree­ment about re­li­ance on #9 was prob­a­bly due to differ­ent in­ter­pre­ta­tions of what it means to op­ti­mize dis­crete tasks. I in­ter­preted it as the agent be­ing trained from scratch for spe­cific tasks, while Jan in­ter­preted it as the agent hav­ing some kind of ob­jec­tive (po­ten­tially very high-level like “max­i­mize hu­man ap­proval”).

  • Not sure what it would mean not to use prob­a­bil­is­tic in­fer­ence or what the al­ter­na­tive could be.

An ad­di­tional as­sump­tion men­tioned by the au­di­ence is the agent/​en­vi­ron­ment sep­a­ra­tion. The al­ter­na­tive to this as­sump­tion is be­ing ex­plored by MIRI in their work on em­bed­ded agents. I think we rely on this a lot, and it’s mod­er­ately likely to hold up.

Vishal pointed out a gen­eral ar­gu­ment for the cur­rent as­sump­tions hold­ing up. If there are many ways to build gen­eral AI, then the ap­proaches with a lot of re­sources and effort be­hind them are more likely to suc­ceed (as­sum­ing that the ap­proach in ques­tion could pro­duce gen­eral AI in prin­ci­ple).

Ap­proaches to speci­fi­ca­tion problems

Speci­fi­ca­tion prob­lems arise when spec­i­fy­ing the ob­jec­tive of the AI sys­tem. An ob­jec­tive speci­fi­ca­tion is a proxy for the hu­man de­signer’s prefer­ences. A mis­speci­fied ob­jec­tive that does not match the hu­man’s in­ten­tion can re­sult in un­de­sir­able be­hav­iors that max­i­mize the proxy but don’t solve the goal.

There are two classes of ap­proaches to speci­fi­ca­tion prob­lems—hu­man-in-the-loop (e.g. re­ward learn­ing or iter­ated am­plifi­ca­tion) and prob­lem-spe­cific (e.g. side effects /​ im­pact mea­sures or re­ward cor­rup­tion the­ory). Hu­man-in-the-loop ap­proaches are more gen­eral and can ad­dress many safety prob­lems at once, but it can be difficult to tell when the re­sult­ing sys­tem has re­ceived the right amount and type of hu­man data to pro­duce safe be­hav­ior. The two ap­proaches are com­ple­men­tary, and could be com­bined by us­ing prob­lem-spe­cific solu­tions as an in­duc­tive bias for hu­man-in-the-loop ap­proaches.

We con­sid­ered some ar­gu­ments for us­ing pure hu­man-in-the-loop learn­ing and for com­ple­ment­ing it with prob­lem-spe­cific ap­proaches, and rated how strong each ar­gu­ment is (on a 1-10 scale):

Our rat­ings agreed on most of these ar­gu­ments:

  • The strongest rea­sons to fo­cus on pure hu­man-in-the-loop learn­ing are #1 (get­ting some safety-rele­vant con­cepts for free) and #3 (un­known un­knowns).

  • The strongest rea­sons to com­ple­ment with prob­lem-spe­cific ap­proaches are #1 (use­ful in­duc­tive bias) and #3 (higher un­der­stand­ing and trust).

  • Re­ward hack­ing is an is­sue for non-adap­tive meth­ods is an is­sue (e.g. if the re­ward func­tion is frozen in re­ward learn­ing), but mak­ing prob­lems-spe­cific ap­proaches adap­tive does not seem that hard.

The main dis­agree­ment was about ar­gu­ment #4 for com­ple­ment­ing—Jan ex­pects the ob­jec­tives/​strate­gies is­sues to be more eas­ily solv­able than I do.

Over­all, this was an in­ter­est­ing con­ver­sa­tion that helped clar­ify my un­der­stand­ing of the safety re­search land­scape. This is part of a longer con­ver­sa­tion that is very much a work in progress, and we ex­pect to con­tinue dis­cussing and up­dat­ing our views on these con­sid­er­a­tions.