Alignment Newsletter #43

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


AlphaS­tar: Mas­ter­ing the Real-Time Strat­egy Game StarCraft II (The AlphaS­tar team): The AlphaS­tar sys­tem from Deep­Mind has beaten top hu­man pros at StarCraft. You can read about the par­tic­u­lar de­tails of the matches in many sources, such as the blog post it­self, this Vox ar­ti­cle, or Im­port AI. The quick sum­mary is that while there are some rea­sons you might not think it is con­clu­sively su­per­hu­man yet (no­tably, it only won when it didn’t have to ma­nipu­late the cam­era, and even then it may have had short bursts of very high ac­tions per minute that hu­mans can’t do), it is clearly ex­tremely good at StarCraft, both at the tech­ni­cally pre­cise micro level and at the strate­gic macro level.

I want to fo­cus in­stead on the tech­ni­cal de­tails of how AlphaS­tar works. The key ideas seem to be a) us­ing imi­ta­tion learn­ing to get poli­cies that do some­thing rea­son­able to start with and b) train­ing a pop­u­la­tion of agents in or­der to ex­plore the full space of strate­gies and how to play against all of them, with­out any catas­trophic for­get­ting. Speci­fi­cally, they take a dataset of hu­man games and train var­i­ous agents to mimic hu­mans. This al­lows them to avoid the par­tic­u­larly hard ex­plo­ra­tion prob­lems that hap­pen when you start with a ran­dom agent. Once they have these agents to start with, they be­gin to do pop­u­la­tion-based train­ing, where they play agents against each other and up­date their weights us­ing an RL al­gorithm. The pop­u­la­tion of agents evolves over time, with well-perform­ing agents split­ting into two new agents that di­ver­sify a bit more. Some agents also have aux­iliary re­wards that en­courage them to ex­plore differ­ent parts of the strat­egy space—for ex­am­ple, an agent might get re­ward for build­ing a spe­cific type of unit. Once train­ing is done, we have a fi­nal pop­u­la­tion of agents. Us­ing their em­piri­cal win prob­a­bil­ities, we can con­struct a Nash equil­ibrium of these agents, which forms the fi­nal AlphaS­tar agent. (Note: I’m not sure if at the be­gin­ning of the game, one of the agents is cho­sen ac­cord­ing to the Nash prob­a­bil­ities, or if at each timestep an ac­tion is cho­sen ac­cord­ing to the Nash prob­a­bil­ities. I would ex­pect the former, since the lat­ter would re­sult in one agent mak­ing a long-term plan that is then ru­ined by a differ­ent agent tak­ing some other ac­tion, but the blog post seems to in­di­cate the lat­ter—with the former, it’s not clear why the com­pute abil­ity of a GPU re­stricts the num­ber of agents in the Nash equil­ibrium, which the blog posts men­tions.)

There are also a bunch of in­ter­est­ing tech­ni­cal de­tails on how they get this to ac­tu­ally work, which you can get some in­for­ma­tion about in this Red­dit AMA. For ex­am­ple, “we in­cluded a policy dis­til­la­tion cost to en­sure that the agent con­tinues to try hu­man-like be­havi­ours with some prob­a­bil­ity through­out train­ing, and this makes it much eas­ier to dis­cover un­likely strate­gies than when start­ing from self-play”, and “there are el­e­ments of our re­search (for ex­am­ple tem­po­rally ab­stract ac­tions that choose how many ticks to de­lay, or the adap­tive se­lec­tion of in­cen­tives for agents) that might be con­sid­ered “hi­er­ar­chi­cal””. But it’s prob­a­bly best to wait for the jour­nal pub­li­ca­tion (which is cur­rently in prepa­ra­tion) for the full de­tails.

I’m par­tic­u­larly in­ter­ested by this Bal­duzzi et al pa­per that gives some more the­o­ret­i­cal jus­tifi­ca­tion for the pop­u­la­tion-based train­ing. In par­tic­u­lar, the pa­per in­tro­duces the con­cept of “gamescapes”, which can be thought of as a ge­o­met­ric vi­su­al­iza­tion of which strate­gies beat which other strate­gies. In some games, like “say a num­ber be­tween 1 and 10, you get re­ward equal to your num­ber—op­po­nent’s num­ber”, the gamescape is a 1-D line—there is a scalar value of “how good a strat­egy is”, and a bet­ter strat­egy will beat a weaker strat­egy. On the other hand, rock-pa­per-scis­sors is a cyclic game, and the gamescape looks like a tri­an­gle—there’s no strat­egy that strictly dom­i­nates all other strate­gies. Even the Nash strat­egy of ran­dom­iz­ing be­tween all three ac­tions is not the “best”, in that it fails to ex­ploit sub­op­ti­mal strate­gies, eg. the strat­egy of always play­ing rock. With games that are even some­what cyclic (such as StarCraft), rather than try­ing to find the Nash equil­ibrium, we should try to ex­plore and map out the en­tire strat­egy space. The pa­per also has some the­o­ret­i­cal re­sults sup­port­ing this that I haven’t read through in de­tail.

Ro­hin’s opinion: I don’t care very much about whether AlphaS­tar is su­per­hu­man or not—it clearly is very good at StarCraft at both the micro and macro lev­els. Whether it hits the rather ar­bi­trary level of “top hu­man perfor­mance” is not as in­ter­est­ing as the fact that it is any­where in the bal­l­park of “top hu­man perfor­mance”.

It’s in­ter­est­ing to com­pare this to OpenAI Five (AN #13). While OpenAI solved the ex­plo­ra­tion prob­lem us­ing a com­bi­na­tion of re­ward shap­ing and do­main ran­dom­iza­tion, Deep­Mind solved it by us­ing imi­ta­tion learn­ing on hu­man games. While OpenAI re­lied pri­mar­ily on self-play, Deep­Mind used pop­u­la­tion-based train­ing in or­der to deal with catas­trophic for­get­ting and in or­der to be ro­bust to many differ­ent strate­gies. It’s pos­si­ble that this is be­cause of the games they were play­ing—it’s plau­si­ble to me that StarCraft has more rock-pa­per-scis­sors-like cyclic me­chan­ics than Dota, and so it’s more im­por­tant to be ro­bust to many strate­gies in StarCraft. But I don’t know ei­ther game very well, so this is pure spec­u­la­tion.

Ex­plor­ing the full strat­egy space rather than find­ing the Nash equil­ibrium seems like the right thing to do, though I haven’t kept up with the mul­ti­a­gent RL liter­a­ture so take that with a grain of salt. That said, it doesn’t seem like the full solu­tion—you also want some way of iden­ti­fy­ing what strat­egy your op­po­nent is play­ing, so that you can choose the op­ti­mal strat­egy to play against them.

I of­ten think about how you can build AI sys­tems that co­op­er­ate with hu­mans. This can be sig­nifi­cantly harder: in com­pet­i­tive games, if your op­po­nent is more sub­op­ti­mal than you were ex­pect­ing, you just crush them even harder. How­ever, in a co­op­er­a­tive game, if you make a bad as­sump­tion about what your part­ner will do, you can get sig­nifi­cantly worse perfor­mance. (If you’ve played Han­abi, you’ve prob­a­bly ex­pe­rienced this.) Self-play does not seem like it would han­dle this situ­a­tion, but this kind of pop­u­la­tion-based train­ing could po­ten­tially han­dle it, if you also had a method to iden­tify how your part­ner is play­ing. (Without such a method, you would play some generic strat­egy that would hope­fully be quite ro­bust to playstyles, but would still not be nearly as good as be­ing able to pre­dict what your part­ner does.)

Read more: Open-ended Learn­ing in Sym­met­ric Zero-sum Games, AMA with AlphaS­tar cre­ators and pro play­ers, and Vox: StarCraft is a deep, com­pli­cated war strat­egy game. Google’s AlphaS­tar AI crushed it.

Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety (Richard Ngo): This post lays out six dis­tinct ar­gu­ments for the im­por­tance of AI safety. First, the clas­sic ar­gu­ment that ex­pected util­ity max­i­miz­ers (or, as I pre­fer to call them, goal-di­rected agents) are dan­ger­ous be­cause of Good­hart’s Law, frag­ility of value and con­ver­gent in­stru­men­tal sub­goals. Se­cond, we don’t know how to ro­bustly “put a goal” in­side an AI sys­tem, such that its be­hav­ior will then look like the pur­suit of that goal. (As an anal­ogy, evolu­tion might seem like a good way to get agents that pur­sue re­pro­duc­tive fit­ness, but it ended up cre­at­ing hu­mans who de­cid­edly do not pur­sue re­pro­duc­tive fit­ness sin­gle-mind­edly.) Third, as we cre­ate many AI sys­tems that grad­u­ally be­come the main ac­tors in our econ­omy, these AI sys­tems will con­trol most of the re­sources of the fu­ture. There will likely be some di­ver­gence be­tween what the AI “val­ues” and what we value, and for suffi­ciently pow­er­ful AI sys­tems we will no longer be able to cor­rect these di­ver­gences, sim­ply be­cause we won’t be able to un­der­stand their de­ci­sions. Fourth, it seems that a good fu­ture re­quires us to solve hard philos­o­phy prob­lems that hu­mans can­not yet solve (so that even if the fu­ture was con­trol­led by a hu­man it would prob­a­bly not turn out well), and so we would need to ei­ther solve these prob­lems or figure out an al­gorithm to solve them. Fifth, pow­er­ful AI ca­pa­bil­ities could be mi­sused by mal­i­cious ac­tors, or they could in­ad­ver­tently lead to doom through co­or­di­na­tion failures, eg. by de­vel­op­ing ever more de­struc­tive weapons. Fi­nally, the broad­est ar­gu­ment is sim­ply that AI is go­ing to have a large im­pact on the world, and so of course we want to en­sure that the im­pact is pos­i­tive.

Richard then spec­u­lates on what in­fer­ences to make from the fact that differ­ent peo­ple have differ­ent ar­gu­ments for work­ing on AI safety. His pri­mary take­away is that we are still con­fused about what prob­lem we are solv­ing, and so we should spend more time clar­ify­ing fun­da­men­tal ideas and de­scribing par­tic­u­lar de­ploy­ment sce­nar­ios and cor­re­spond­ing threat mod­els.

Ro­hin’s opinion: I think the over­ar­ch­ing prob­lem is the last one, that AI will have large im­pacts and we don’t have a strong story for why they will nec­es­sar­ily be good. Since it is very hard to pre­dict the fu­ture, es­pe­cially with new tech­nolo­gies, I would ex­pect that differ­ent peo­ple try­ing to con­cretize this very broad worry into a more con­crete one would end up with differ­ent sce­nar­ios, and this mostly ex­plains the pro­lifer­a­tion of ar­gu­ments. Richard does note a similar effect by con­sid­er­ing the ex­am­ple of what ar­gu­ments the origi­nal nu­clear risk peo­ple could have made, and find­ing a similar pro­lifer­a­tion of ar­gu­ments.

Set­ting aside the over­ar­ch­ing ar­gu­ment #6, I find all of the ar­gu­ments fairly com­pel­ling, but I’m prob­a­bly most wor­ried about #1 (suit­ably re­for­mu­lated in terms of goal-di­rect­ed­ness) and #2. It’s plau­si­ble that I would also find some of the mul­ti­a­gent wor­ries more com­pel­ling once more re­search has been done on them; so far I don’t have much clar­ity about them.

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

Learn­ing with catas­tro­phes (Paul Chris­ti­ano): In iter­ated am­plifi­ca­tion, we need to train a fast agent from a slow one pro­duced by am­plifi­ca­tion (AN #42). We need this train­ing to be such that the re­sult­ing agent never does any­thing catas­trophic at test time. In iter­ated am­plifi­ca­tion, we do have the benefit of hav­ing a strong over­seer who can give good fed­back. This sug­gests a for­mal­iza­tion for catas­tro­phes. Sup­pose there is some or­a­cle that can take any se­quence of ob­ser­va­tions and ac­tions and la­bel it as catas­trophic or not. How do we use this or­a­cle to train an agent that will never pro­duce catas­trophic be­hav­ior at test time?

Given un­limited com­pute and un­limited ac­cess to the or­a­cle, this prob­lem is easy: sim­ply search over all pos­si­ble en­vi­ron­ments and ask the or­a­cle if the agent be­haves catas­troph­i­cally on them. If any such be­hav­ior is found, train the agent to not perform that be­hav­ior any more. Re­peat un­til all catas­trophic be­hav­ior is elimi­nated. This is ba­si­cally a very strong form of ad­ver­sar­ial train­ing.

Ro­hin’s opinion: I’m not sure how nec­es­sary it is to ex­plic­itly aim to avoid catas­trophic be­hav­ior—it seems that even a low ca­pa­bil­ity cor­rigible agent would still know enough to avoid catas­trophic be­hav­ior in prac­tice. How­ever, based on Tech­niques for op­ti­miz­ing worst-case perfor­mance, sum­ma­rized be­low, it seems like the mo­ti­va­tion is ac­tu­ally to avoid catas­trophic failures of cor­rigi­bil­ity, as op­posed to all catas­tro­phes.

In fact, we can see that we can’t avoid all catas­tro­phes with­out some as­sump­tion on ei­ther the en­vi­ron­ment or the or­a­cle. Sup­pose the en­vi­ron­ment can do any­thing com­putable, and the or­a­cle eval­u­ates be­hav­ior only based on out­comes (ob­ser­va­tions). In this case, for any ob­ser­va­tion that the or­a­cle would la­bel as catas­trophic, there is an en­vi­ron­ment that re­gard­less of the agent’s ac­tion out­puts that ob­ser­va­tion, and there is no agent that can always avoid catas­tro­phe. So for this prob­lem to be solv­able, we need to ei­ther have a limit on what the en­vi­ron­ment “could do”, or an or­a­cle that judges “catas­tro­phe” based on the agent’s ac­tion in ad­di­tion to out­comes. That lat­ter op­tion can cache out to “are the ac­tions in this tran­script know­ably go­ing to cause some­thing bad to hap­pen”, which sounds very much like cor­rigi­bil­ity.

Thoughts on re­ward en­g­ineer­ing (Paul Chris­ti­ano): This post digs into some of the “easy” is­sues with re­ward en­g­ineer­ing (where we must de­sign a good re­ward func­tion for an agent, given ac­cess to a stronger over­seer).

First, in or­der to han­dle out­comes over long time hori­zons, we need to have the re­ward func­tion cap­ture the over­seer’s eval­u­a­tion of the long-term con­se­quences of an ac­tion, since it isn’t fea­si­ble to wait un­til the out­comes ac­tu­ally hap­pen.

Se­cond, since hu­man judg­ments are in­con­sis­tent and un­re­li­able, we could have the agent choose an ac­tion such that there is no other ac­tion which the over­seer would eval­u­ate as bet­ter in a com­par­i­son be­tween the two. (This is not ex­actly right—the hu­man’s com­par­i­sons could be such that this is an im­pos­si­ble stan­dard. The post uses a two-player game for­mu­la­tion that avoids the is­sue, and gives the guaran­tee that the agent won’t choose some­thing that is un­am­bigu­ously worse than an­other op­tion.)

Third, since the agent will be un­cer­tain about the over­seer’s re­ward, it will have the equiv­a­lent of nor­ma­tive un­cer­tainty—how should it trade off be­tween differ­ent pos­si­ble re­ward func­tions the over­seer could have? One op­tion is to choose a par­tic­u­lar yard­stick, eg. how much the over­seer val­ues a minute of their time, some small amount of money, etc. and nor­mal­ize all re­wards to that yard­stick.

Fourth, when there are de­ci­sions with very widely-vary­ing scales of re­wards, tra­di­tional al­gorithms don’t work well. Nor­mally we could fo­cus on the high-stakes de­ci­sions and ig­nore the oth­ers, but if the high-stakes de­ci­sions oc­cur in­fre­quently then all de­ci­sions are about equally im­por­tant. In this case, we could over­sam­ple high-stakes de­ci­sions and re­duce their re­wards (i.e. im­por­tance sam­pling) to use tra­di­tional al­gorithms to learn effec­tively with­out chang­ing the over­all “mean­ing” of the re­ward func­tion. How­ever, very rare+high-stakes de­ci­sions will prob­a­bly re­quire ad­di­tional tech­niques.

Fifth, for sparse re­ward func­tions where most be­hav­ior is equally bad, we need to provide “hints” about what good be­hav­ior looks like. Re­ward shap­ing is the main cur­rent ap­proach, but we do need to make sure that by the end of train­ing we are us­ing the true re­ward, not the shaped one. Lots of other in­for­ma­tion such as demon­stra­tions can also be taken as hints that al­low you to get higher re­ward.

Fi­nally, the re­ward will likely be suffi­ciently com­plex that we can­not write it down, and so we’ll need to rely on an ex­pen­sive eval­u­a­tion by the over­seer. We will prob­a­bly need semi-su­per­vised RL in or­der to make this suffi­ciently com­pu­ta­tion­ally effi­cient.

Ro­hin’s opinion: As the post notes, these prob­lems are only “easy” in the con­cep­tual sense—the re­sult­ing RL prob­lems could be quite hard. I feel most con­fused about the third and fourth prob­lems. Choos­ing a yard­stick could work to ag­gre­gate re­ward func­tions, but I still worry about the is­sue that this tends to over­weight re­ward func­tions that as­sign a low value to the yard­stick but high value to other out­comes. With widely-vary­ing re­wards, it seems hard to im­por­tance sam­ple high-stakes de­ci­sions, with­out know­ing what those de­ci­sions might be. Maybe if we no­tice a very large re­ward, we in­stead make it lower re­ward, but over­sam­ple it in the fu­ture? Some­thing like this could po­ten­tially work, but I don’t see how yet.

For com­plex, ex­pen­sive-to-eval­u­ate re­wards, Paul sug­gests us­ing semi-su­per­vised learn­ing; this would be fine if semi-su­per­vised learn­ing was suffi­cient, but I worry that there ac­tu­ally isn’t enough in­for­ma­tion in just a few eval­u­a­tions of the re­ward func­tion to nar­row down on the true re­ward suffi­ciently, which means that even con­cep­tu­ally we will need some­thing else.

Tech­niques for op­ti­miz­ing worst-case perfor­mance (Paul Chris­ti­ano): There are “be­nign” failures of worst-case perfor­mance, where the AI sys­tem en­coun­ters a novel situ­a­tion and be­haves weirdly, but not in a way that sys­tem­at­i­cally dis­fa­vors hu­man val­ues. As I noted above, we can’t get rid of all of these, but that’s prob­a­bly fine. We in­stead would like to fo­cus on “ma­lign” failures, where the AI sys­tem ap­plies its in­tel­li­gence in pur­suit of the wrong goal. There are a few tech­niques that could be ap­plied to this prob­lem.

With ad­ver­sar­ial train­ing, we can have one sys­tem find in­puts on which our agent fails catas­troph­i­cally, and then train the agent to avoid those be­hav­iors. The main is­sue here is that there will likely be some failures that aren’t found.

On the other hand, ver­ifi­ca­tion has strong guaran­tees, but suffers from the prob­lem that it is hard to know what to spec­ify, and it is com­pu­ta­tion­ally ex­pen­sive to ac­tu­ally perform ver­ifi­ca­tion. If we have a strong trusted over­seer, eg. pro­duced by am­plifi­ca­tion, we could use it as a very ex­pen­sive speci­fi­ca­tion. Alter­na­tively, we could use a catas­tro­phe-checker in lieu of a speci­fi­ca­tion. (Note: While I can see some ways of us­ing catas­tro­phe-check­ers, the post seems to have a spe­cific method in mind that I don’t un­der­stand.)

Any ma­lig­nant failure must be us­ing the in­tel­li­gence of the agent some­how, and the agent is only in­tel­li­gent on the train­ing data, so if we can use strong trans­parency tech­niques on the train­ing data, we could find such failures. How­ever, if you use trans­parency for this, by de­fault you weed out the com­pre­hen­si­ble failures and leave in the in­com­pre­hen­si­ble ones.

Ro­hin’s opinion: While I agree that it would be great to op­ti­mize worst-case perfor­mance, all of these tech­niques feel quite difficult to do scal­ably and with guaran­tees. With ad­ver­sar­ial train­ing, you need to find al­lof the ways that an agent could fail, while catas­tro­phe could hap­pen if the agent stum­bles across any of these meth­ods. It seems plau­si­ble to me that with suffi­cient ad­di­tional in­for­ma­tion given to the ad­ver­sary we can meet this stan­dard, but it seems very hard to know­ably meet this stan­dard, i.e. to have a strong ar­gu­ment that we will find all of the po­ten­tial is­sues.

With ver­ifi­ca­tion, the speci­fi­ca­tion prob­lem seems like a deal-breaker, un­less com­bined with other meth­ods: a ma­jor point with AI safety is that we can’t write down a good speci­fi­ca­tion for what we want. If we in­stead only use ver­ifi­ca­tion to prop­a­gate worst-case guaran­tees from one model to an­other (as the post sug­gests), then it seems pos­si­ble in the­ory, but very ex­pen­sive in prac­tice: most ver­ifi­ca­tion tech­niques as­sume un­limited fast ac­cess to the speci­fi­ca­tion, whereas our speci­fi­ca­tion is very ex­pen­sive to query. Of course, not much re­search has fo­cused on this set­ting, so we can ex­pect large gains; nonethe­less you do need to some­how ex­trap­o­late the speci­fi­ca­tion to all pos­si­ble in­puts, which seems hard to do with limited ac­cess to the speci­fi­ca­tion.

Trans­parency also seems like it pro­vides ad­di­tional safety, rather than mak­ing any guaran­tees, since we prob­a­bly can’t get a guaran­tee that our trans­parency mechanisms can show us all pos­si­ble failure modes in a way that we un­der­stand them. The ar­gu­ment that we can only fo­cus on the train­ing data makes the as­sump­tion that the AI sys­tem is not go­ing to gen­er­al­ize well out­side of the train­ing dataset. While I’m sym­pa­thetic to this as­sump­tion (we re­ally don’t have good meth­ods for gen­er­al­iza­tion, and there are strong rea­sons to ex­pect gen­er­al­iza­tion to be near-im­pos­si­ble), it isn’t one that I’m con­fi­dent about, es­pe­cially when we’re talk­ing about gen­eral in­tel­li­gence.

Of course, I’m still ex­cited for more re­search to be done on these top­ics, since they do seem to cut out some ad­di­tional failure modes. But if we’re look­ing to have a semi-for­mal strong ar­gu­ment that we will have good worst-case perfor­mance, I don’t see the rea­sons for op­ti­mism about that.

Value learn­ing sequence

The hu­man side of in­ter­ac­tion (Ro­hin Shah): The lens of hu­man-AI in­ter­ac­tion (AN #41) also sug­gests that we should fo­cus on what the hu­man should do in AI al­ign­ment.

Any feed­back that the AI sys­tem gets must be in­ter­preted us­ing some as­sump­tion. For ex­am­ple, when a hu­man pro­vides an AI sys­tem a re­ward func­tion, it shouldn’t be in­ter­preted as a de­scrip­tion of op­ti­mal be­hav­ior in ev­ery pos­si­ble situ­a­tion (which is what we cur­rently do im­plic­itly). In­verse Re­ward De­sign (IRD) sug­gests an al­ter­na­tive, more re­al­is­tic as­sump­tion: the re­ward func­tion is likely to the ex­tent that it leads to high true util­ity in the train­ing en­vi­ron­ment. Similarly, in in­verse re­in­force­ment learn­ing (IRL) hu­man demon­stra­tions are of­ten in­ter­preted un­der the as­sump­tion of Boltz­mann ra­tio­nal­ity.

Analo­gously, we may also want to train hu­mans to give feed­back to AI sys­tems in the man­ner that they are ex­pect­ing. With IRD, the re­ward de­signer should make sure to test the re­ward func­tion ex­ten­sively in the train­ing en­vi­ron­ment. If we want our AI sys­tem to help us with long-term goals, we may want the over­seers to be much more cau­tious and un­cer­tain in their feed­back (de­pend­ing on how such feed­back is in­ter­preted). Tech­niques that learn to rea­son like hu­mans, such as iter­ated am­plifi­ca­tion and de­bate, would by de­fault learn to in­ter­pret feed­back the way hu­mans do. Nev­er­the­less it will prob­a­bly be use­ful to train hu­mans to provide use­ful feed­back: for ex­am­ple, in de­bate, we want hu­mans to judge which side pro­vided more true and use­ful in­for­ma­tion.

Fu­ture di­rec­tions for nar­row value learn­ing (Ro­hin Shah): This post sum­ma­rizes some fu­ture di­rec­tions for nar­row value learn­ing that I’m par­tic­u­larly in­ter­ested in from a long-term per­spec­tive.


Disen­tan­gling ar­gu­ments for the im­por­tance of AI safety (Richard Ngo): Sum­ma­rized in the high­lights!

Agent foundations

Clar­ify­ing Log­i­cal Coun­ter­fac­tu­als (Chris Leong)

Learn­ing hu­man intent

ReNeg and Back­seat Driver: Learn­ing from De­mon­stra­tion with Con­tin­u­ous Hu­man Feed­back (Ja­cob Beck et al)

Han­dling groups of agents

The­ory of Minds: Un­der­stand­ing Be­hav­ior in Groups Through In­verse Plan­ning (Michael Shum, Max Kleiman-Weiner et al) (sum­ma­rized by Richard): This pa­per in­tro­duces Com­pos­able Team Hier­ar­chies (CTH), a rep­re­sen­ta­tion de­signed for rea­son­ing about how agents rea­son about each other in col­lab­o­ra­tive and com­pet­i­tive en­vi­ron­ments. CTH uses two “plan­ning op­er­a­tors”: the Best Re­sponse op­er­a­tor re­turns the best policy in a sin­gle-agent game, and the Joint Plan­ning op­er­a­tor re­turns the best team policy when all agents are co­op­er­at­ing. Com­pet­i­tive poli­cies can then be de­rived via re­cur­sive ap­pli­ca­tion of those op­er­a­tions to sub­sets of agents (while hold­ing the poli­cies of other agents fixed). CTH draws from ideas in level-K plan­ning (in which each agent as­sumes all other agents are at level K-1) and co­op­er­a­tive plan­ning, but is more pow­er­ful than ei­ther ap­proach.

The au­thors ex­per­i­ment with us­ing CTH to prob­a­bil­is­ti­cally in­fer poli­cies and fu­ture ac­tions of agents par­ti­ci­pat­ing in the stag-hunt task; they find that these judge­ments cor­re­late well with hu­man data.

Richard’s opinion: This is a cool the­o­ret­i­cal frame­work. Its rele­vance de­pends on how likely you think it is that so­cial cog­ni­tion will be a core com­po­nent of AGI, as op­posed to just an­other task to be solved us­ing gen­eral-pur­pose rea­son­ing. I imag­ine that most AI safety re­searchers lean to­wards the lat­ter, but there are some rea­sons to give cre­dence to the former.


Fore­cast­ing Trans­for­ma­tive AI: An Ex­pert Sur­vey (Ross Gruet­zemacher et al)

Near-term concerns

Fair­ness and bias

Iden­ti­fy­ing and Cor­rect­ing La­bel Bias in Ma­chine Learn­ing (Hein­rich Jiang and Ofir Nachum)

AI strat­egy and policy

FLI Pod­cast- Ar­tifi­cial In­tel­li­gence: Amer­i­can At­ti­tudes and Trends (Ariel Conn and Baobao Zhang): This is a pod­cast about The Amer­i­can Public’s At­ti­tudes Con­cern­ing Ar­tifi­cial In­tel­li­gence (AN #41), you can see my very brief sum­mary of that.

Other progress in AI


Am­plify­ing the Imi­ta­tion Effect for Re­in­force­ment Learn­ing of UCAV’s Mis­sion Ex­e­cu­tion (Gyeong Taek Lee et al)

Re­in­force­ment learning

AlphaS­tar: Mas­ter­ing the Real-Time Strat­egy Game StarCraft II (The AlphaS­tar team): Sum­ma­rized in the high­lights!

Deep learning

At­ten­tive Neu­ral Pro­cesses (Hyun­jik Kim et al)


SafeML ICLR 2019 Call for Papers (Vic­to­ria Krakovna et al): The SafeML work­shop has a pa­per sub­mis­sion dead­line of Feb 22, and is look­ing for pa­pers on speci­fi­ca­tion, ro­bust­ness and as­surance (based on Build­ing safe ar­tifi­cial in­tel­li­gence: speci­fi­ca­tion, ro­bust­ness, and as­surance (AN #26)).