AI Alignment Research Overview (by Jacob Steinhardt)

Link post

I’m re­ally ex­cited to see some­one out­line all the work they think needs solv­ing in AI al­ign­ment—to de­scribe what the prob­lem looks like, what a solu­tion looks like, and what work has been done so far. Espe­cially from Ja­cob, who is a coau­thor of the Con­crete Prob­lems in AI Safety pa­per.

Below, I’ve in­cluded some ex­cerpts from doc. I’ve in­cluded the in­tro­duc­tion, the fol­low­ing sec­tion de­scribing the cat­e­gories of tech­ni­cal work, and some high-level in­for­ma­tion from the long sec­tions on ‘tech­ni­cal al­ign­ment prob­lem’ and the ‘de­tect­ing failures in ad­vance’.


This doc­u­ment gives an overview of differ­ent ar­eas of tech­ni­cal work that seem nec­es­sary, or at least de­sir­able, for cre­at­ing safe and al­igned AI sys­tems. The fo­cus is on safety and al­ign­ment of pow­er­ful AI sys­tems, i.e. sys­tems that may ex­ceed hu­man ca­pa­bil­ities in a broad va­ri­ety of do­mains, and which likely act on a large scale. Cor­re­spond­ingly, there is an em­pha­sis on ap­proaches that seem scal­able to such sys­tems.

By “al­igned”, I mean that the ac­tions it pur­sues move the world to­wards states that hu­mans want, and away from states that hu­mans don’t want. Some is­sues with this defi­ni­tion are that differ­ent hu­mans might have differ­ent prefer­ences (I will mostly ig­nore this is­sue), and that there are differ­ences be­tween stated prefer­ences, “re­vealed” prefer­ences as im­plied by ac­tions, and prefer­ences that one en­dorses upon re­flec­tion (I won’t ig­nore this is­sue).

I think it is quite plau­si­ble that some top­ics are miss­ing, and I wel­come com­ments to that re­gard. My goal is to out­line a crit­i­cal mass of top­ics in enough de­tail that some­one with knowl­edge of ML and some limited fa­mil­iar­ity with AI al­ign­ment as an area would have a col­lec­tion of promis­ing re­search di­rec­tions, a mechanis­tic un­der­stand­ing of why they are promis­ing, and some poin­t­ers for what work on them might look like.

To that end, be­low I out­line four broad cat­e­gories of tech­ni­cal work: tech­ni­cal al­ign­ment (the over­com­ing of con­cep­tual or en­g­ineer­ing is­sues needed to cre­ate al­igned AI), de­tect­ing failures (the de­vel­op­ment of tools for proac­tively as­sess­ing the safety/​al­ign­ment of a sys­tem or ap­proach), method­olog­i­cal un­der­stand­ing (best prac­tices backed up by ex­pe­rience), and sys­tem-build­ing (how to tie to­gether the three pre­ced­ing cat­e­gories in the con­text of many en­g­ineers work­ing on a large sys­tem). Th­ese are de­scribed in more de­tail in the next sec­tion.

In each sec­tion I give ex­am­ples of prob­lems we might want to solve. I imag­ine these in the con­text of fu­ture pow­er­ful AI sys­tems, which means that most of the con­crete sce­nar­ios are spec­u­la­tive, vague, and likely in­cor­rect if in­ter­preted as a pre­dic­tion about the fu­ture. If I were to give the strongest jus­tifi­ca­tion for the re­search top­ics be­low, I would in­stead fo­cus on near-fu­ture and ex­ist­ing sys­tems, which already ex­hibit many of the is­sues I dis­cuss. Nev­er­the­less, I think this imag­i­na­tive ex­er­cise can be helpful both for stim­u­lat­ing re­search and for keep­ing the fo­cus on scal­able solu­tions.

Caveats. I found it difficult to write a re­search overview of a field as nascent as AI al­ign­ment, as any­thing I could write sounded ei­ther too au­thor­i­ta­tive rel­a­tive to my con­fi­dence, or so full of caveats and qual­ifi­ca­tions as to be un­read­able. I set­tled for elid­ing many of the qual­ifi­ca­tions and pro­vid­ing this sin­gle caveat up front: that this doc­u­ment re­flects an im­perfect snap­shot of my cur­rent think­ing, that it ex­presses many ideas more slop­pily than I would usu­ally feel com­fortable putting into writ­ing, and that I hope read­ers will for­give this slop­piness in the ser­vice of say­ing some­thing about a topic that I feel is im­por­tant.

This doc­u­ment is not meant to be a de­scrip­tion of my per­sonal in­ter­ests, but rather of po­ten­tially promis­ing top­ics within a field I care about. My own in­ter­ests are nei­ther a sub­set nor su­per­set of the top­ics in this doc­u­ment, al­though there is high over­lap. Even con­fined to AI al­ign­ment, this doc­u­ment is out-of-date and omits some of my re­cent think­ing on eco­nomic as­pects of ML.

Fi­nally, I make a num­ber of claims be­low about what re­search di­rec­tions I think are promis­ing or un-promis­ing. Some of these claims are likely wrong, and I could even imag­ine chang­ing my mind af­ter 1 hour of con­ver­sa­tion with the right per­son. I de­cided that this doc­u­ment would be more in­for­ma­tive and read­able if I gave my un­filtered take (rather than only opinions I thought I would likely defend upon con­sid­er­a­tion), but the flip side is that if you think I’m wrong about some­thing, you should let me know!

Cat­e­gories of tech­ni­cal work

In this doc­u­ment, I will dis­cuss four broad cat­e­gories of tech­ni­cal work:

Tech­ni­cal al­ign­ment prob­lem. Re­search on the “tech­ni­cal al­ign­ment prob­lem” ei­ther ad­dresses con­cep­tual ob­sta­cles to mak­ing AI al­igned with hu­mans (e.g. ro­bust­ness, re­ward mis-speci­fi­ca­tion), or cre­ates tools and frame­works that aid in mak­ing AI al­igned (e.g. scal­able re­ward gen­er­a­tion).

De­tect­ing failures in ad­vance. In­de­pen­dently of hav­ing solved var­i­ous al­ign­ment prob­lems, we would like to have ways of prob­ing sys­tems /​ blueprints of sys­tems to know whether they are likely to be safe. Ex­am­ple top­ics in­clude in­ter­pretabil­ity, red-team­ing, or ac­cu­mu­lat­ing check­lists of failure modes to watch out for.

Method­olog­i­cal un­der­stand­ing. There is rel­a­tively lit­tle agree­ment or first-hand knowl­edge of how to make sys­tems al­igned or safe, and even less about which meth­ods for do­ing so will scale to very pow­er­ful AI sys­tems. I am per­son­ally skep­ti­cal of our abil­ity to get al­ign­ment right based on purely ab­stract ar­gu­ments with­out also hav­ing a lot of method­olog­i­cal ex­pe­rience, which is why I think work in this cat­e­gory is im­por­tant. An ex­am­ple of a method­ol­ogy-fo­cused doc­u­ment is Martin Zinke­vich’s Rules of Reli­able ML, which ad­dresses re­li­a­bil­ity of ex­ist­ing large sys­tems.

Sys­tem-build­ing. It is pos­si­ble that build­ing pow­er­ful AI will in­volve a large en­g­ineer­ing effort (say, 100+ en­g­ineers, 300k+ lines of code). In this case we need a frame­work for putting many com­po­nents to­gether in a safe way.

Tech­ni­cal al­ign­ment problem

We would ideally like to build AI that acts ac­cord­ing to some speci­fi­ca­tion of hu­man val­ues, and that is ro­bust both to er­rors in the speci­fi­ca­tion and to events in the world. To achieve this ro­bust­ness, the sys­tem likely needs to rep­re­sent un­cer­tainty about both its un­der­stand­ing of hu­man val­ues and its be­liefs about the world, and to act ap­pro­pri­ately in the face of this un­cer­tainty to avoid any catas­trophic events.

I split the tech­ni­cal al­ign­ment prob­lem cor­re­spond­ingly into four sub-cat­e­gories:

Scal­able re­ward gen­er­a­tion. Pow­er­ful AI sys­tems will po­ten­tially have to make de­ci­sions in situ­a­tions that are for­eign to hu­mans or oth­er­wise difficult to eval­u­ate—for in­stance, on scales far out­side hu­man ex­pe­rience, or in­volv­ing sub­tle but im­por­tant down­stream con­se­quences. Since mod­ern ML sys­tems are pri­mar­ily trained through hu­man-la­beled train­ing data (or more gen­er­ally, hu­man-gen­er­ated re­ward func­tions), this pre­sents an ob­sta­cle to spec­i­fy­ing which de­ci­sions are good in these situ­a­tions. Scal­able re­ward gen­er­a­tion seeks to build pro­cesses for gen­er­at­ing a good re­ward func­tion.

Re­ward learn­ing. Many au­tonomous agents seek to max­i­mize the ex­pected value of some re­ward func­tion (or more broadly, to move to­wards some speci­fied goal state /​ set of states). Op­ti­miz­ing against the re­ward func­tion in this way can cause even slight er­rors in the re­ward to lead to large er­rors in be­hav­ior—typ­i­cally, in­creased re­ward will be well-cor­re­lated with hu­man-de­sir­a­bil­ity for a while, but will be­come anti-cor­re­lated af­ter a point. Re­ward learn­ing seeks to rea­son about differ­ences be­tween the ob­served (proxy) re­ward and the true re­ward, and to con­verge to the true re­ward over time.

Out-of-dis­tri­bu­tion ro­bust­ness is the prob­lem of get­ting sys­tems to be­have well on in­puts that are very differ­ent from their train­ing data. This might be done by a com­bi­na­tion of trans­fer learn­ing (so the sys­tem works well in a broader va­ri­ety of situ­a­tions) and hav­ing more un­cer­tainty in the face of un­fa­mil­iar/​atyp­i­cal in­puts (so the sys­tem can at least no­tice where it is likely to not do well).

Act­ing con­ser­va­tively. Safe out­comes are more likely if sys­tems can no­tice situ­a­tions where it is un­clear how to act, and ei­ther avoid en­coun­ter­ing them, take ac­tions that re­duce the un­cer­tainty, or take ac­tions that are ro­bustly good. This would, for in­stance, al­low us to spec­ify an am­bigu­ous re­ward func­tion that the sys­tem could clar­ify as needed, rather than hav­ing to think about ev­ery pos­si­ble case up-front.

Act­ing con­ser­va­tively in­ter­faces with re­ward learn­ing and out-of-dis­tri­bu­tion ro­bust­ness, as the lat­ter two fo­cus on notic­ing un­cer­tainty while the former fo­cuses on what to do given the un­cer­tainty. Un­for­tu­nately, cur­rent meth­ods for con­struct­ing un­cer­tainty es­ti­mates seem in­ad­e­quate to drive such de­ci­sions, and even given a good un­cer­tainty es­ti­mate lit­tle work has been done on how the sys­tem should use it to shape its ac­tions.

A toy frame­work. Con­cep­tu­ally, it may be use­ful to think in terms of the stan­dard ra­tio­nal agent model, where an agent has a value func­tion or util­ity func­tion , and be­liefs , and then takes ac­tions that max­i­mize the ex­pected value of un­der (con­di­tioned on the ac­tion ). Failures of al­ign­ment could come from in­cor­rect be­liefs , or a value func­tion that does not lead to what hu­mans want. Out-of-dis­tri­bu­tion ro­bust­ness seeks to avoid or no­tice prob­lems with , while scal­able re­ward gen­er­a­tion seeks to pro­duce ac­cu­rate in­for­ma­tion about some value func­tion that is al­igned with hu­mans. Re­ward learn­ing seeks to cor­rect for in­ac­cu­ra­cies in the re­ward gen­er­a­tion pro­cess, as well as the likely limited amount of to­tal data about re­wards. Fi­nally, act­ing con­ser­va­tively takes into ac­count the ad­di­tional un­cer­tainty due to act­ing out-of-dis­tri­bu­tion and hav­ing a learned re­ward func­tion, and seeks to choose ac­tions in a cor­re­spond­ingly con­ser­va­tive man­ner.

In an RL set­ting where we take ac­tions via a learned policy, we can tell the same story but with a slightly mod­ified di­a­gram. In­stead of an ac­tion we have a learned policy , and in­stead of and de­not­ing be­liefs, they de­note dis­tri­bu­tions over en­vi­ron­ments ( is the true on-policy en­vi­ron­ment at de­ploy­ment time, while is the dis­tri­bu­tion of train­ing en­vi­ron­ments).

Other top­ics. Beyond the top­ics above, the prob­lem of coun­ter­fac­tual rea­son­ing cuts across mul­ti­ple cat­e­gories and seems worth study­ing on its own. There may be other im­por­tant cat­e­gories of tech­ni­cal work as well.

De­tect­ing failures in advance

The pre­vi­ous sec­tion lays out a list of ob­sta­cles to AI al­ign­ment and tech­ni­cal di­rec­tions for work­ing on them. This list may not be ex­haus­tive, so we should also de­velop tools for dis­cov­er­ing new po­ten­tial al­ign­ment is­sues. Even for the ex­ist­ing is­sues, we would like ways of be­ing more con­fi­dent that we have solved them and what sub-prob­lems re­main.

While ma­chine learn­ing of­ten prefers to hew close to em­piri­cal data, much of the roadmap for AI al­ign­ment has in­stead fol­lowed from more ab­stract con­sid­er­a­tions and thought ex­per­i­ments, such as ask­ing “What would hap­pen if this re­ward func­tion were op­ti­mized as far as pos­si­ble? Would the out­come be good?” I ac­tu­ally think that ML un­der­val­ues this ab­stract ap­proach and ex­pect it to con­tinue to be fruit­ful, both for point­ing to use­ful high-level re­search ques­tions and for an­a­lyz­ing con­crete sys­tems and ap­proaches.

At the same time, I am un­com­fortable rely­ing solely on ab­stract ar­gu­ments for de­tect­ing po­ten­tial failures. Ri­gor­ous em­piri­cal test­ing can make us more con­fi­dent that a prob­lem is ac­tu­ally solved and ex­pose is­sues we might have missed. Find­ing con­crete in­stan­ti­a­tions of a prob­lem can both more fruit­fully di­rect work and con­vince a larger set of peo­ple to care about it (as in the case of ad­ver­sar­ial ex­am­ples for images). More broadly, em­piri­cal in­ves­ti­ga­tions have the po­ten­tial to re­veal new is­sues that were missed un­der purely ab­stract con­sid­er­a­tions.

Two more em­piri­cally-fo­cused ways of de­tect­ing failures are model prob­ing/​vi­su­al­iza­tion and red-team­ing, dis­cussed be­low. Also valuable is ex­am­in­ing trends in ML. For in­stance, it looks to me like re­ward hack­ing in real de­ployed sys­tems is be­com­ing a big­ger is­sue over time; this pro­vides con­crete in­stances of the prob­lem to ex­am­ine for in­sight, gives us a way to mea­sure how well we’re do­ing at the prob­lem, and helps rally a com­mu­nity around the prob­lem. Ex­am­in­ing trends is also a good way to take an ab­stract con­sid­er­a­tion and make it more con­crete.

No comments.