Directions and desiderata for AI alignment

Note: This is the first post from part four: what needs do­ing of the se­quence on iter­ated am­plifi­ca­tion. The fourth part of the se­quence de­scribes some of the black boxes in iter­ated am­plifi­ca­tion and dis­cusses what we would need to do to fill in those boxes. I think these are some of the most im­por­tant open ques­tions in AI al­ign­ment.

In the first half of this post, I’ll dis­cuss three re­search di­rec­tions that I think are es­pe­cially promis­ing and rele­vant to AI al­ign­ment:

  1. Reli­a­bil­ity and ro­bust­ness. Build­ing ML sys­tems which be­have ac­cept­ably in the worst case rather than only on the train­ing dis­tri­bu­tion.

  2. Over­sight /​ re­ward learn­ing. Con­struct­ing ob­jec­tives and train­ing strate­gies which lead our poli­cies to do what we in­tend.

  3. De­liber­a­tion and am­plifi­ca­tion. Sur­pass­ing hu­man perfor­mance with­out si­mul­ta­neously aban­don­ing hu­man prefer­ences.

I think that we have sev­eral an­gles of at­tack on each of these prob­lems, and that solu­tions would sig­nifi­cantly im­prove our abil­ity to al­ign AI. My cur­rent feel­ing is that these ar­eas cover much of the key work that needs to be done.

In the sec­ond half of the post, I’ll dis­cuss three desider­ata that I think should guide re­search on al­ign­ment:

  1. Se­cure. Our solu­tions should work ac­cept­ably even when the en­vi­ron­ment it­self is un­der the in­fluence of an ad­ver­sary.

  2. Com­pet­i­tive. Our solu­tions should im­pose min­i­mal over­head, perfor­mance penalties, or re­stric­tions com­pared to ma­lign AI.

  3. Scal­able. Our solu­tions should con­tinue to work well even when the un­der­ly­ing learn­ing sys­tems im­prove sig­nifi­cantly.

I think that tak­ing these re­quire­ments se­ri­ously leads us to sub­stan­tially nar­row our fo­cus.

It may turn out that these desider­ata are im­pos­si­ble to meet, but if so I think that the first or­der of busi­ness should be un­der­stand­ing clearly why they are im­pos­si­ble. This would let us bet­ter tar­get our work on al­ign­ment and bet­ter pre­pare for a fu­ture where we won’t have a com­pletely satis­fy­ing solu­tion to al­ign­ment.

(The ideas in this post are not novel. My claimed con­tri­bu­tion is merely col­lect­ing these things to­gether. I will link to my own writ­ing on each topic in large part be­cause that’s what I know.)

I. Re­search directions

1. Reli­a­bil­ity and robustness

Tra­di­tional ML al­gorithms op­ti­mize a model or policy to perform well on the train­ing dis­tri­bu­tion. Th­ese mod­els can be­have ar­bi­trar­ily badly when we move away from the train­ing dis­tri­bu­tion. Similarly, they can be­have ar­bi­trar­ily badly on a small part of the train­ing dis­tri­bu­tion.

I think this is bad news:

  • De­ploy­ing ML sys­tems will crit­i­cally change their en­vi­ron­ment, in a way that is hard or im­pos­si­ble to simu­late at train­ing time. (The “treach­er­ous turn” is a spe­cial case of this phe­nomenon.)

  • De­ployed ML sys­tems are in­ter­con­nected and ex­posed to the same world. So if con­di­tions change in a way that causes one of them to fail, manysys­tems may fail si­mul­ta­neously.

  • If ML sys­tems are ex­tremely pow­er­ful, or if they play a crit­i­cal role in so­ciety, then a wide­spread failure may have catas­trophic con­se­quences.

I’m aware of three ba­sic ap­proaches to re­li­a­bil­ity that seem to me like they could plau­si­bly scale and be com­pet­i­tive:

(ETA: this list is su­per­seded by the list in Tech­niques for Op­ti­miz­ing Worst-Case Perfor­mance. I re­moved con­sen­sus and added in­ter­pretabil­ity and ver­ifi­ca­tion. I don’t dis­cuss “learn­ing the right model,” which I still con­sider a long shot.)

  • Ad­ver­sar­ial train­ing. At train­ing time, at­tempt to con­struct in­puts that in­duce prob­le­matic be­hav­ior and train on those. Even­tu­ally, we hope there will be no catas­tro­phe-in­duc­ing in­puts left. We don’t yet know what is pos­si­ble to achieve. (Szegedy 2014, Good­fel­low 2015)

  • Ensem­bling and con­sen­sus. We of­ten have con­fi­dence that there ex­ists some mod­els which will gen­er­al­ize ap­pro­pri­ately. If we can ver­ify that many mod­els agree about an an­swer, we can be con­fi­dent that the con­sen­sus is cor­rect. If we use this tech­nique, we will of­ten need to ab­stain on un­fa­mil­iar in­puts, and in or­der to re­main com­pet­i­tive we will prob­a­bly need to rep­re­sent the en­sem­ble im­plic­itly. (Khani 2016)

  • Learn­ing the right model. If we un­der­stood enough about the struc­ture of our model (for ex­am­ple if it re­flected the struc­ture of the un­der­ly­ing data-gen­er­at­ing pro­cess), we might be con­fi­dent that it will gen­er­al­ize cor­rectly. Very few re­searchers are aiming for a se­cure /​ com­pet­i­tive /​ scal­able solu­tion along these lines, and find­ing one seems al­most (but not com­pletely) hope­less to me. This is MIRI’s ap­proach.

Usual caveats ap­ply: these ap­proaches may need to be used in com­bi­na­tion; we are likely to un­cover com­pletely differ­ent ap­proaches in the fu­ture; and I’m prob­a­bly over­look­ing im­por­tant ex­ist­ing ap­proaches.

I think this prob­lem is pretty well-un­der­stood and well-rec­og­nized, but it looks re­ally hard. ML re­searchers mostly fo­cus on im­prov­ing perfor­mance rather than ro­bust­ness, and so I think that this area re­mains ne­glected de­spite the prob­lem be­ing well-rec­og­nized.

(Pre­vi­ous posts on this blog: red teams, learn­ing with catas­tro­phes, thoughts on train­ing highly re­li­able mod­els)

2. Over­sight /​ re­ward learning

ML sys­tems are typ­i­cally trained by op­ti­miz­ing some ob­jec­tive over the train­ing dis­tri­bu­tion. For this to yield “good” be­hav­ior, the ob­jec­tive needs to suffi­ciently close to what we re­ally want.

I think this is also bad news:

  • Some tasks are very “easy” to frame as op­ti­miza­tion prob­lems. For ex­am­ple, we can already write an ob­jec­tive to train an RL agent to op­er­ate a profit-max­i­miz­ing au­tonomous cor­po­ra­tion (though for now we can only train very weak agents).

  • Many tasks that hu­mans care about, such as main­tain­ing law and or­der or helping us bet­ter un­der­stand our val­ues, are ex­tremely hard to con­vert into pre­cise ob­jec­tives: they are in­her­ently poorly-defined or in­volve very long timescales, and sim­ple prox­ies can be “gamed” by a so­phis­ti­cated agent.

  • As a re­sult, many tasks that hu­mans care about may not get done well; we may find our­selves in an in­creas­ingly so­phis­ti­cated and com­plex world driven by com­pletely alien val­ues.

So far, the most promis­ing an­gle of at­tack is to op­ti­mize ex­tremely com­plex ob­jec­tives, pre­sum­ably by learn­ing them.

I’m aware of two ba­sic ap­proaches to re­ward learn­ing that seem like they could plau­si­bly scale:

  • In­verse re­in­force­ment learn­ing. We can ob­serve hu­man be­hav­ior in a do­main and try to in­fer what the hu­man is “try­ing to do,” con­vert­ing it into an ob­jec­tive that can be used to train our sys­tems. (Rus­sell 1998, Ng 2000, Had­field-Menell 2016)

  • Learn­ing from hu­man feed­back. We can pose queries to hu­mans to figure out which be­hav­iors or out­comes they pre­fer, and then op­ti­mize our sys­tems ac­cord­ingly. (Is­bell 2001, Thomaz 2006, Pilarski 2011, Knox 2012)

Th­ese solu­tions seem much closer to work­ing than those listed in the pre­vi­ous sec­tion on re­li­a­bil­ity and ro­bust­ness. But they still face many challenges, and are not yet com­pet­i­tive, scal­able, or se­cure:

  • IRL re­quires a prior over prefer­ences and a model of how hu­man be­hav­ior re­lates to hu­man prefer­ences. Cur­rent im­ple­men­ta­tions ei­ther only work in severely re­stricted en­vi­ron­ments, or use sim­ple mod­els of hu­man ra­tio­nal­ity which cause the learner to at­tempt to very pre­cisely imi­tate the hu­man’s be­hav­ior (which might be challeng­ing or im­pos­si­ble).

  • For similar rea­sons, ex­ist­ing IRL im­ple­men­ta­tions are not able to learn from other data like hu­man ut­ter­ances or off-policy be­hav­ior, even though these con­sti­tute the largest and rich­est source of data about hu­man prefer­ences.

  • Hu­man feed­back re­quires ac­cu­rately elic­it­ing hu­man prefer­ences, which in­tro­duces many com­pli­ca­tions. (I dis­cuss a few easy prob­lems here.)

  • Hu­man feed­back is ex­pen­sive and so we will need to be able to learn from a rel­a­tively small amount of la­beled data. De­mon­stra­tions are also ex­pen­sive and so may end up be­ing a bot­tle­neck for ap­proaches based on IRL though it’s not as clear.

  • Both imi­ta­tion learn­ing and hu­man feed­back may fail when eval­u­at­ing a be­hav­ior re­quires un­der­stand­ing where the be­hav­ior came from. For ex­am­ple, if you ask a hu­man to eval­u­ate a paint­ing they may not be able to eas­ily check whether it is deriva­tive, even if over the long run they would pre­fer their AI to paint novel paint­ings.

(I’ve de­scribed these ap­proaches in the con­text of “hu­man” be­hav­ior, but the ex­pert pro­vid­ing feed­back/​demon­stra­tions might them­selves be a hu­man aug­mented with AI as­sis­tance, and even­tu­ally may sim­ply be an AI sys­tem that is al­igned with hu­man in­ter­ests.)

This prob­lem has not re­ceived much at­ten­tion in the past, but it seems to be rapidly grow­ing in pop­u­lar­ity, which is great. I’m cur­rently work­ing on a pro­ject in this area.

(Pre­vi­ous posts on this blog: the re­ward en­g­ineer­ing prob­lem, am­bi­tious vs. nar­row value learn­ing, against mimicry, thoughts on re­ward en­g­ineer­ing.)

3. De­liber­a­tion and amplification

Ma­chine learn­ing is usu­ally ap­plied to tasks where feed­back is read­ily available. The re­search prob­lem in the pre­vi­ous sec­tion aims to ob­tain quick feed­back in gen­eral by us­ing hu­man judg­ments as the “gold stan­dard.” But this ap­proach breaks down if we want to ex­ceed hu­man perfor­mance.

For ex­am­ple, it is easy to see how we could use ma­chine learn­ing to train ML sys­tems to make hu­man-level judg­ments about ur­ban plan­ning, by train­ing them to pro­duce plans that sound good to hu­mans. But if we want to train an ML sys­tem to make su­per­hu­man judg­ments about how to lay out a city, it’s com­pletely un­clear how we could do it — with­out spend­ing billions of dol­lars try­ing out the sys­tem’s ideas and tel­ling it which ones work.

This is a prob­lem for the same rea­sons dis­cussed in the pre­ced­ing sec­tion. If our so­ciety is driven by sys­tems su­per­hu­manly op­ti­miz­ing short-term prox­ies for what we care about — such as how much they im­press hu­mans, or how much money they make—then we are li­able to head off in a di­rec­tion which does not re­flect our val­ues or leave us in mean­ingful con­trol of the situ­a­tion.

If we low­ered our am­bi­tions and de­cide that su­per­hu­man perfor­mance is in­her­ently un­safe, we would be leav­ing huge amounts of value on the table. More­over, this would be an un­sta­ble situ­a­tion: it could last only as long as ev­ery­one with ac­cess to AI co­or­di­nated to pull their punches and hand­i­cap their AI sys­tems.

I’m aware of two ap­proaches to this prob­lem that seem like they could scale:

  • IRL [hard mode]. In prin­ci­ple we can use IRL to re­cover a rep­re­sen­ta­tion of hu­man prefer­ences, and then ap­ply su­per­hu­man in­tel­li­gence to satisfy those prefer­ences much bet­ter than a hu­man could. How­ever, this is a much more am­bi­tious and challeng­ing form of IRL than is usu­ally dis­cussed, which re­mains quite challeng­ing even when you set aside all of the usual al­gorith­mic and statis­ti­cal difficul­ties. (Ja­cob Stein­hardt and Owain Evans dis­cuss this is­sue in a re­cent post.)

  • Iter­ated am­plifi­ca­tion. A group of in­ter­act­ing hu­mans can po­ten­tially be smarter than a sin­gle hu­man, and a group of AI sys­tems could be smarter than the origi­nal AI sys­tem. By us­ing these groups as “ex­perts” in place of in­di­vi­d­ual hu­mans, we could po­ten­tially train much smarter sys­tems. The key ques­tions are how to perform this com­po­si­tion in a way that causes the group to im­ple­ment the same prefer­ences as its mem­bers, and whether the cog­ni­tive benefits for groups are large enough to over­come the over­head of co­or­di­na­tion. (I dis­cuss this ap­proach here and in fol­low-up work.)

  • IRL for cog­ni­tion. Rather than ap­ply­ing IRL to a hu­mans’ ac­tions, we could ap­ply it to the cog­ni­tive ac­tions taken by a hu­man while they de­liber­ate about a sub­ject. We can then use those val­ues to ex­e­cute a longer de­liber­a­tion pro­cess, ask­ing “what would the hu­man do if they had more time to think /​ more pow­er­ful cog­ni­tive tools?” I think this ap­proach ends up be­ing similar to a blend of the pre­vi­ous two.

It’s com­pletely un­clear how hard this prob­lem is or how far we are from a solu­tion. It is a much less com­mon re­search topic than ei­ther of the pre­ced­ing points.

In the short term, I think it might be eas­ier to study analogs of this prob­lem in the con­text of hu­man be­hav­ior than to at­tempt to di­rectly study it in the con­text of AI sys­tems.

Ought is a non-profit aimed at ad­dress­ing (roughly) this prob­lem; I think it is rea­son­ably likely to make sig­nifi­cant progress.

(Pre­vi­ous posts on this blog: ca­pa­bil­ity am­plifi­ca­tion, re­li­a­bil­ity am­plifi­ca­tion, se­cu­rity am­plifi­ca­tion, meta-ex­e­cu­tion, the easy goal in­fer­ence prob­lem is still hard)

II. Desiderata

I’m most in­ter­ested in al­gorithms that are se­cure, com­pet­i­tive, and scal­able, and I think that most re­search pro­grams are very un­likely to de­liver these desider­ata (this is why the lists above are so short).

Since these desider­ata are do­ing a lot of work in nar­row­ing down the space of pos­si­ble re­search di­rec­tions, it seems worth­while to be thought­ful and clear about them. It would be easy to gloss over any of them as ob­vi­ously un­ob­jec­tion­able, but I would be more in­ter­ested in peo­ple push­ing back on the strong forms than im­plic­itly ac­cept­ing a milder form.

1. Secure

Many pieces of soft­ware work “well enough” most of the time; we of­ten learn this not by a deep anal­y­sis but by just try­ing it and see­ing what hap­pens. “Works well enough” of­ten breaks down when an ad­ver­sary en­ters the pre­dic­tion.

Whether or not that’s a good way to build AI, I think it’s a bad way to do al­ign­ment re­search right now.

In­stead, we should try to come up with al­ign­ment solu­tions that work in the least con­ve­nient world, when na­ture it­self is be­hav­ing ad­ver­sar­i­ally. Ac­com­plish­ing this re­quires ar­gu­ment and anal­y­sis, and can­not be ex­clu­sively or based on em­piri­cal ob­ser­va­tion.

AI sys­tems ob­vi­ously won’t work well in the worst case (there is no such thing as a free lunch) but it’s rea­son­able to hope that our AI sys­tems will never re­spond to a bad in­put by ac­tively try­ing to hurt us —at least as long as we re­main in phys­i­cal con­trol of the com­put­ing hard­ware, and the train­ing pro­cess, etc.

Why does se­cu­rity seem im­por­tant?

  • It’s re­ally hard to an­ti­ci­pate what is go­ing to hap­pen in the fu­ture. I think it’s easy to peer into the mists and say “well, hard to know what’s go­ing to hap­pen, but this solu­tion might work out OK,” and then to turn out to be too op­ti­mistic. It’s harder to make this er­ror when we hold our­selves to a higher stan­dard, of ac­tu­ally giv­ing an ar­gu­ment for why things work. I think that this is a gen­eral prin­ci­ple for do­ing use­ful re­search in ad­vance of when it is needed — we should hold our­selves to stan­dards that are un­am­bigu­ous and clear even when the fu­ture is murky. This is a theme that will re­cur in the com­ing sec­tions.

  • We are used to tech­nolog­i­cal progress pro­ceed­ing slowly com­pared to timescales of hu­man judg­ment and plan­ning. It seems quite likely that pow­er­ful AI will be de­vel­oped dur­ing or af­ter a pe­riod of ac­cel­er­a­tion, challeng­ing those as­sump­tions and un­der­min­ing a tra­di­tional iter­a­tive ap­proach to de­vel­op­ment.

  • The world re­ally does con­tain ad­ver­saries. It’s one thing to build in­se­cure soft­ware when ma­chines have power over mod­est amounts of money with sig­nifi­cant hu­man over­sight, it’s an­other thing al­to­gether when they have pri­mary re­spon­si­bil­ity for en­forc­ing the law. I’m not even par­tic­u­larly wor­ried about hu­man at­tack­ers, I’m mostly wor­ried about a fu­ture where all it takes to launch at­tacks is money (which can it­self be earned by ex­e­cut­ing at­tacks). More­over, if the un­der­ly­ing ML is in­se­cure and ML plays a role in al­most all soft­ware, we are go­ing to have a hard time writ­ing any se­cure soft­ware at all.

(Pre­vi­ous posts: se­cu­rity and AI al­ign­ment)

2. Competitive

It’s easy to avoid build­ing an un­safe AI sys­tem (for ex­am­ple: build a spread­sheet in­stead). The only ques­tion is how much you have to sac­ri­fice to do it.

Ideally we’ll be able to build be­nign AI sys­tems that are just as effi­cient and ca­pa­ble as the best AI that we could build by any means. That means: we don’t have to ad­di­tional do­main-spe­cific en­g­ineer­ing work to al­ign our sys­tems, be­nign AI doesn’t re­quire too much more data or com­pu­ta­tion, and our al­ign­ment tech­niques don’t force us to use par­tic­u­lar tech­niques or re­strict our choices in other ways.

(More pre­cisely, I would con­sider an al­ign­ment strat­egy a suc­cess if the ad­di­tional costs are sub­lin­ear: if the frac­tion of re­sources that need to be spent on al­ign­ment re­search and run-time over­head de­creases as the AI sys­tems be­come more pow­er­ful, con­verg­ing to­wards 0.)

Why is com­pet­i­tive­ness im­por­tant?

A. It’s easy to tell when a solu­tion is plau­si­bly com­pet­i­tive, but very hard to tell ex­actly how un­com­pet­i­tive an un­com­pet­i­tive solu­tion will be. For ex­am­ple, if a pur­ported al­ign­ment strat­egy re­quires an AI not to use tech­nique or de­vel­op­ment strat­egy X, it’s easy to tell that this pro­posal isn’t com­pet­i­tive in gen­eral, but very hard to know ex­actly how un­com­pet­i­tive it is.

As in the se­cu­rity case, it seems very easy to look into the fog of the fu­ture and say “well this seems like it will prob­a­bly be OK” and then to turn out to be too op­ti­mistic. If we hold our­selves to the higher stan­dard of com­pet­i­tive­ness, it is much eas­ier to stay hon­est.

Re­lat­edly, we want al­ign­ment solu­tions that work across an ex­tremely large range of tech­niques not just be­cause we are un­cer­tain about which tech­niques will be im­por­tant, but be­cause gen­er­al­iz­ing across all of the situ­a­tions we can fore­see is a good pre­dic­tor of work­ing for situ­a­tions we can’t fore­see.

B. You can’t unilat­er­ally use un­com­pet­i­tive al­ign­ment tech­niques; we would need global co­or­di­na­tion to avoid trou­ble. If we don’t know how to build com­pet­i­tive be­nign AI, then users/​de­sign­ers of AI sys­tems have to com­pro­mise effi­ciency in or­der to main­tain re­li­able con­trol over those sys­tems. The most effi­cient sys­tems will by de­fault be built by who­ever is will­ing to ac­cept the largest risk of catas­tro­phe (or per­haps by ac­tors who con­sider un­al­igned AI a de­sir­able out­come).

It may be pos­si­ble to avert this kind of race to the bot­tom by effec­tive co­or­di­na­tion by e.g. en­forc­ing reg­u­la­tions which man­date ad­e­quate in­vest­ments in al­ign­ment or re­strict what kinds of AI are de­ployed. En­forc­ing such con­trols do­mes­ti­cally is already a huge headache. But in­ter­na­tion­ally things are even worse: a coun­try that hand­i­capped its AI in­dus­try in or­der to pro­ceed cau­tiously would face the risk of be­ing over­taken by a less pru­dent com­peti­tor, and avoid­ing that race would re­quire effec­tive in­ter­na­tional co­or­di­na­tion.

Ul­ti­mately so­ciety will be able and will­ing to pay some effi­ciency cost to re­li­ably al­ign AI with hu­man in­ter­ests. But the higher that cost, the harder the co­or­di­na­tion prob­lem that we will need to solve. I think the re­search com­mu­nity should be try­ing to make that co­or­di­na­tion prob­lem as easy as pos­si­ble.

(Pre­vi­ous posts: pro­saic AI al­ign­ment, a pos­si­ble stance for AI con­trol, effi­cient and safely scal­able)

3. Scalable

Over time, we are ac­quiring more data, more pow­er­ful com­put­ers, richer model classes, bet­ter op­ti­miza­tion al­gorithms, bet­ter ex­plo­ra­tion strate­gies, and so on. If we ex­trap­o­late these trends, we end up with very pow­er­ful mod­els and poli­cies.

Many ap­proaches to al­ign­ment break down at some point in this ex­trap­o­la­tion. For ex­am­ple, if we train an RL agent with a re­ward func­tion which im­perfectly ap­prox­i­mates what we want, it is likely to fail once the agent be­comes suffi­ciently so­phis­ti­cated — un­less the re­ward func­tion it­self be­comes more so­phis­ti­cated in par­allel.

In con­trast, let’s say that a tech­nique is “scal­able” if it con­tinues to work just as well even when the un­der­ly­ing learn­ing be­comes much more pow­er­ful. (See also: Eliezer’s more col­or­ful “om­nipo­tence test.”)

This is an­other ex­tremely de­mand­ing re­quire­ment. It rules out many pos­si­ble ap­proaches to al­ign­ment. For ex­am­ple, it prob­a­bly rules out any ap­proach that in­volves hand-en­g­ineer­ing re­ward func­tions. More sub­tly, I ex­pect it will rule out any ap­proach that re­quires hand-en­g­ineer­ing an in­for­ma­tive prior over hu­man val­ues (though some day we will hope­fully find a scal­able ap­proach to IRL).

Why is scal­a­bil­ity im­por­tant?

  • As in the pre­vi­ous sec­tions, it’s easy to be too op­ti­mistic about ex­actly when a non-scal­able al­ign­ment scheme will break down. It’s much eas­ier to keep our­selves hon­est if we ac­tu­ally hold our­selves to pro­duc­ing scal­able sys­tems.

  • If AI progress rapidly, and es­pe­cially if AI re­search is sub­stan­tially au­to­mated, then we may liter­ally con­front the situ­a­tion where the ca­pa­bil­ities of our AI sys­tems are chang­ing rapidly. It would be de­sir­able to have al­ign­ment schemes that con­tinued to work in this case.

  • If we don’t have scal­able solu­tions then we re­quire a con­tin­u­ing in­vest­ment of re­search on al­ign­ment in or­der to “keep up” with im­prove­ments in the un­der­ly­ing learn­ing. This risks com­pro­mis­ing com­pet­i­tive­ness, forc­ing AI de­vel­op­ers to make a hard trade­off be­tween al­ign­ment and ca­pa­bil­ities. This would be ac­cept­able if the on­go­ing in­vest­ments in al­ign­ment are mod­est com­pared to the in­vest­ments in ca­pa­bil­ities. But as with the last point, that’s a very murky ques­tion about which it seems easy to be overly op­ti­mistic in ad­vance. If we think the prob­lem will be easy in the fu­ture when we have more com­put­ing, then we ought to be able to do it now. Or at the very least we ought to be able to ex­plain how more com­put­ing will make it easy. If we make such an ex­pla­na­tion suffi­ciently pre­cise then it will it­self be­come a scal­able al­ign­ment pro­posal (though per­haps one that in­volves on­go­ing hu­man effort).

(Pre­vi­ous posts: scal­able AI con­trol)

Aside: feasibility

One might re­ject these desider­ata be­cause they seem too de­mand­ing: it would be great if we had a se­cure, com­pet­i­tive and scal­able ap­proach to al­ign­ment, but that might not be pos­si­ble.

I am in­ter­ested in try­ing to satisfy these desider­ata de­spite the fact that they are quite de­mand­ing, for two rea­sons:

  • I think that it is very hard to say in ad­vance what is pos­si­ble or im­pos­si­ble. I don’t yet see any fun­da­men­tal ob­struc­tions to achiev­ing these goals, and un­til I see hard ob­struc­tions I think there is a sig­nifi­cant prob­a­bil­ity that the prob­lem will prove to be fea­si­ble (or “al­most pos­si­ble,” in the sense that we may need to weaken these goals only slightly).

  • If there is some fun­da­men­tal ob­struc­tion to achiev­ing these goals, then it would be good to un­der­stand that ob­struc­tion in de­tail. Un­der­stand­ing it would help us un­der­stand the na­ture of the prob­lem we face and would al­low us to do bet­ter re­search on al­ign­ment (by fo­cus­ing on the key as­pects of the prob­lem). And know­ing that these prob­lems are im­pos­si­ble, and un­der­stand­ing ex­actly how im­pos­si­ble they are, helps us pre­pare for the fu­ture, to build in­sti­tu­tions and mechanisms that will be needed to cope with un­avoid­able limi­ta­tions of our AI al­ign­ment strate­gies.

III. Conclusion

I think there is a lot of re­search to be done on AI al­ign­ment; we are limited by a lack of time and la­bor rather than by a lack of ideas about how to make progress.

Re­search rele­vant to al­ign­ment is already un­der­way; re­searchers and fun­ders in­ter­ested in al­ign­ment can get a lot of mileage by sup­port­ing and flesh­ing out ex­ist­ing re­search pro­grams in rele­vant di­rec­tions. I don’t think it is cor­rect to as­sume that if any­one is work­ing on a prob­lem then it is go­ing to get solved — even amongst things that aren’t liter­ally at the “no one else is do­ing it” level, there are vary­ing de­grees of ne­glect.

At the same time, the goals of al­ign­ment are suffi­ciently un­usual that we shouldn’t be sur­prised or con­cerned to find our­selves do­ing un­usual re­search. I think that area #3 on de­liber­a­tion and am­plifi­ca­tion is al­most com­pletely empty, and will prob­a­bly re­main pretty empty un­til we have clearer state­ments of the prob­lem or con­vinc­ing demon­stra­tions of work in that area.

I think the dis­t­in­guish­ing fea­ture of re­search mo­ti­vated by AI al­ign­ment should be an em­pha­sis on se­cure, com­pet­i­tive, and scal­able solu­tions. I think these are very de­mand­ing re­quire­ments that sig­nifi­cantly nar­row down the space of pos­si­ble ap­proaches and which are rarely ex­plic­itly con­sid­ered in the cur­rent AI com­mu­nity.

It may turn out that these re­quire­ments are in­fea­si­ble; if so, one key out­put of al­ign­ment re­search will be a bet­ter un­der­stand­ing of the key ob­sta­cles. This un­der­stand­ing can help guide less am­bi­tious al­ign­ment re­search, and can help us pre­pare for a fu­ture in which we won’t have a com­pletely satis­fy­ing solu­tion to AI al­ign­ment.

This post has mostly fo­cused on re­search that would trans­late di­rectly into con­crete sys­tems. I think there is also a need for the­o­ret­i­cal re­search build­ing bet­ter ab­strac­tions for rea­son­ing about op­ti­miza­tion, se­cu­rity, se­lec­tion, con­se­quen­tial­ism, and so on. It is plau­si­ble to me that we will pro­duce ac­cept­able sys­tems with our cur­rent con­cep­tual ma­chin­ery, but if we want to con­vinc­ingly an­a­lyze those sys­tems then I think we will need sig­nifi­cant con­cep­tual progress (and bet­ter con­cepts may lead us to differ­ent ap­proaches). I think that prac­ti­cal and the­o­ret­i­cal re­search will be at­trac­tive to differ­ent re­searchers, and I don’t have strong views about their rel­a­tive value.


This was origi­nally posted here on 6th Fe­bru­ary 2017.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Hu­man-AI In­ter­ac­tion’ by Ro­hin Shah in the se­quence on Value Learn­ing.

The next post in this se­quence will be ‘The re­ward en­g­ineer­ing prob­lem’ by Paul Chris­ti­ano, on Tues­day 15th Jan.