Prosaic AI alignment

(Re­lated: a pos­si­ble stance for AI con­trol.)

It’s con­ceiv­able that we will build “pro­saic” AGI, which doesn’t re­veal any fun­da­men­tally new ideas about the na­ture of in­tel­li­gence or turn up any “un­known un­knowns.” I think we wouldn’t know how to al­ign such an AGI; more­over, in the pro­cess of build­ing it, we wouldn’t nec­es­sar­ily learn any­thing that would make the al­ign­ment prob­lem more ap­proach­able. So I think that un­der­stand­ing this case is a nat­u­ral pri­or­ity for re­search on AI al­ign­ment.

In par­tic­u­lar, I don’t think it is rea­son­able to say “we’ll know how to cross that bridge when we come to it,” or “it’s im­pos­si­ble to do mean­ingful work with­out know­ing more about what pow­er­ful AI will look like.” If you think that pro­saic AGI is plau­si­ble, then we may already know what the bridge will look like when we get to it: if we can’t do mean­ingful work now, then we have a prob­lem.

1. Pro­saic AGI

It now seems pos­si­ble that we could build “pro­saic” AGI, which can repli­cate hu­man be­hav­ior but doesn’t in­volve qual­i­ta­tively new ideas about “how in­tel­li­gence works:”

  • It’s plau­si­ble that a large neu­ral net­work can repli­cate “fast” hu­man cog­ni­tion, and that by cou­pling it to sim­ple com­pu­ta­tional mechanisms — short and long-term mem­ory, at­ten­tion, etc. — we could ob­tain a hu­man-level com­pu­ta­tional ar­chi­tec­ture.

  • It’s plau­si­ble that a var­i­ant of RL can train this ar­chi­tec­ture to ac­tu­ally im­ple­ment hu­man-level cog­ni­tion. This would likely in­volve some com­bi­na­tion of in­gre­di­ents like model-based RL, imi­ta­tion learn­ing, or hi­er­ar­chi­cal RL. There are a whole bunch of ideas cur­rently on the table and be­ing ex­plored; if you can’t imag­ine any of these ideas work­ing out, then I feel that’s a failure of imag­i­na­tion (un­less you see some­thing I don’t).

We will cer­tainly learn some­thing by de­vel­op­ing pro­saic AGI. The very fact that there were no qual­i­ta­tively new ideas is it­self sur­pris­ing. And be­yond that, we’ll get a few more bits of in­for­ma­tion about which par­tic­u­lar ap­proach works, fill in a whole bunch of ex­tra de­tails about how to de­sign and train pow­er­ful mod­els, and ac­tu­ally get some ex­per­i­men­tal data.

But none of these de­vel­op­ments seem to fun­da­men­tally change the al­ign­ment prob­lem, and ex­ist­ing ap­proaches to AI al­ign­ment are not bot­tle­necked on this kind of in­for­ma­tion. Ac­tu­ally hav­ing the AI in front of us may let us work sev­eral times more effi­ciently, but it’s not go­ing to move us from “we have no idea how to pro­ceed” to “now we get it.”

2. Our cur­rent state

2a. The concern

If we build pro­saic su­per­hu­man AGI, it seems most likely that it will be trained by re­in­force­ment learn­ing (ex­tend­ing other frame­works to su­per­hu­man perfor­mance would re­quire new ideas). It’s easy to imag­ine a pro­saic RL sys­tem learn­ing to play games with su­per­hu­man lev­els of com­pe­tence and flex­i­bil­ity. But we don’t have any shovel-ready ap­proach to train­ing an RL sys­tem to au­tonomously pur­sue our val­ues.

To illus­trate how this can go wrong, imag­ine us­ing RL to im­ple­ment a de­cen­tral­ized au­tonomous or­ga­ni­za­tion (DAO) which max­i­mizes its profit. If we had very pow­er­ful RL sys­tems, such a DAO might be able to out­com­pete hu­man or­ga­ni­za­tions at a wide range of tasks — pro­duc­ing and sel­l­ing cheaper wid­gets, but also in­fluenc­ing gov­ern­ment policy, ex­tort­ing/​ma­nipu­lat­ing other ac­tors, and so on.

The share­hold­ers of such a DAO may be able to cap­ture the value it cre­ates as long as they are able to re­tain effec­tive con­trol over its com­put­ing hard­ware /​ re­ward sig­nal. Similarly, as long as such DAOs are weak enough to be effec­tively gov­erned by ex­ist­ing laws and in­sti­tu­tions, they are likely to benefit hu­man­ity even if they rein­vest all of their prof­its.

But as AI im­proves, these DAOs would be­come much more pow­er­ful than their hu­man own­ers or law en­force­ment. And we have no ready way to use a pro­saic AGI to ac­tu­ally rep­re­sent the share­holder’s in­ter­ests, or to gov­ern a world dom­i­nated by su­per­hu­man DAOs. In gen­eral, we have no way to use RL to ac­tu­ally in­ter­pret and im­ple­ment hu­man wishes, rather than to op­ti­mize some con­crete and eas­ily-calcu­lated re­ward sig­nal.

I feel pes­simistic about hu­man prospects in such a world.

2b. Be­hav­ing cautiously

We could re­spond by not let­ting pow­er­ful RL sys­tems act au­tonomously, or hand­i­cap­ping them enough that we can main­tain effec­tive con­trol.

This leads us to a po­ten­tially pre­car­i­ous situ­a­tion: ev­ery­one agrees to de­ploy hand­i­capped sys­tems over which they can main­tain mean­ingful con­trol. But any ac­tor can gain an eco­nomic ad­van­tage by skimp­ing on such an agree­ment, and some peo­ple would pre­fer a world dom­i­nated by RL agents to one dom­i­nated by hu­mans. So there are in­cen­tives for defec­tion; if RL sys­tems are very pow­er­ful, then these in­cen­tives may be large, and even a small num­ber of defec­tors may be able to rapidly over­take the hon­est ma­jor­ity which uses hand­i­capped AI sys­tems.

This makes AI a “de­struc­tive tech­nol­ogy” with similar char­ac­ter­is­tics to e.g. nu­clear weapons, a situ­a­tion I de­scribed in my last post. Over the long run I think we will need to re­li­ably cope with this kind of situ­a­tion, but I don’t think we are there yet. I think we could prob­a­bly han­dle this situ­a­tion, but there would definitely be a sig­nifi­cant risk of trou­ble.

The situ­a­tion is es­pe­cially risky if AI progress is sur­pris­ingly rapid, if the al­ign­ment prob­lem proves to be sur­pris­ingly difficult, if the poli­ti­cal situ­a­tion is tense or dys­func­tional, if other things are go­ing wrong at the same time, if AI de­vel­op­ment is frag­mented, if there is a large “hard­ware over­hang,” and so on.

I think that there are rel­a­tively few plau­si­ble ways that hu­man­ity could per­ma­nently and ir­re­versibly dis­figure its legacy. So I am ex­tremely un­happy with “a sig­nifi­cant risk of trou­ble.”

2c. The cur­rent state of AI alignment

We know many ap­proaches to al­ign­ment, it’s just that none of these are at the stage of some­thing you could ac­tu­ally im­ple­ment (“shovel-ready”) — in­stead they are at the stage of re­search pro­jects with an un­pre­dictable and po­ten­tially long timetable.

For con­crete­ness, con­sider two in­tu­itively ap­peal­ing ap­proaches to AI al­ign­ment:

  • IRL: AI sys­tems could in­fer hu­man prefer­ences from hu­man be­hav­ior, and then try to satisfy those prefer­ences.

  • Nat­u­ral lan­guage: AI sys­tems could have an un­der­stand­ing of nat­u­ral lan­guage, and then ex­e­cute in­struc­tions de­scribed in nat­u­ral lan­guage.

Nei­ther of these ap­proaches is shovel ready, in the sense that we have no idea how to ac­tu­ally write code that im­ple­ments ei­ther of them — you would need to have some good ideas be­fore you even knew what ex­per­i­ments to run.

We might hope that this situ­a­tion will change au­to­mat­i­cally as we build more so­phis­ti­cated AI sys­tems. But I don’t think that’s nec­es­sar­ily the case. “Pro­saic AGI” is at the point where we can ac­tu­ally write down some code and say “maybe this would do su­per­hu­man RL, if you ran it with enough com­put­ing power and you fid­dled with the knobs a whole bunch.” But these al­ign­ment pro­pos­als are nowhere near that point, and I don’t see any “known un­knowns” that would let us quickly close the gap. (By con­struc­tion, pro­saic AGI doesn’t in­volve un­known un­knowns.)

So if we found our­selves with pro­saic AGI to­mor­row, we’d be in the situ­a­tion de­scribed in the last sec­tion, for as long as it took us to com­plete one of these re­search agen­das (or to de­velop and then ex­e­cute a new one). Like I said, I think this would prob­a­bly be OK, but it opens up an un­rea­son­ably high chance of re­ally bad out­comes.

3. Priorities

I think that pro­saic AGI should prob­a­bly be the largest fo­cus of cur­rent re­search on al­ign­ment. In this sec­tion I’ll ar­gue for that claim.

3a. Easy to start now

Pro­saic AI al­ign­ment is es­pe­cially in­ter­est­ing be­cause the prob­lem is nearly as tractable to­day as it would be if pro­saic AGI were ac­tu­ally available.

Ex­ist­ing al­ign­ment pro­pos­als have only weak de­pen­den­cies on most of the de­tails we would learn while build­ing pro­saic AGI (e.g. model ar­chi­tec­tures, op­ti­miza­tion strate­gies, var­i­ance re­duc­tion tricks, aux­iliary ob­jec­tives…). As a re­sult, ig­no­rance about those de­tails isn’t a huge prob­lem for al­ign­ment work. We may even­tu­ally reach the point where those de­tails are crit­i­cally im­por­tant, but we aren’t there yet.

For now, find­ing any plau­si­ble ap­proach to al­ign­ment, that works for any set­ting of un­known de­tails, would be a big ac­com­plish­ment. With such an ap­proach in hand we could start to ask how sen­si­tive it is to the un­known de­tails, but it seems pre­ma­ture to be pes­simistic be­fore even tak­ing that first step.

Note that even in the ex­treme case where our ap­proach to AI al­ign­ment would be com­pletely differ­ent for differ­ent val­ues of some un­known de­tails, the speedup from know­ing them in ad­vance is at most 1/​(prob­a­bil­ity of most likely pos­si­bil­ity). The most plau­si­bly crit­i­cal de­tails are large-scale ar­chi­tec­tural de­ci­sions, for which there is a much smaller space of pos­si­bil­ities.

3b. Importance

If we do de­velop pro­saic AGI with­out learn­ing a lot more about AI al­ign­ment, then I think it would be bad news (see sec­tion 2). Ad­dress­ing al­ign­ment ear­lier, or hav­ing a clear un­der­stand­ing of why it in­tractable, would make the situ­a­tion a lot bet­ter.

I think the main way that an un­der­stand­ing of al­ign­ment could fail to be valuable is if it turns out that al­ign­ment is very easy. But in that case, we should also be able quickly to solve it now (or at least have some can­di­date solu­tion), and then we can move on to other things. So I don’t think “al­ign­ment is very easy” is a pos­si­bil­ity that should keep us up at night.

Align­ment for pro­saic AGI in par­tic­u­lar will be less im­por­tant if we don’t ac­tu­ally de­velop pro­saic AGI, but I think that this is a very big prob­lem:

First, I think there is a rea­son­able chance (>10%) that we will build pro­saic AGI. At this point there don’t seem to be con­vinc­ing ar­gu­ments against the pos­si­bil­ity, and one of the les­sons of the last 30 years is that learn­ing al­gorithms and lots of com­pu­ta­tion/​data can do sur­pris­ingly well com­pared to ap­proaches that re­quire un­der­stand­ing “how to think.”

In­deed, I think that if you had forced some­one in 1990 to write down a con­crete way that an AGI might work, they could eas­ily have put 10–20% of their mass on the same cluster of pos­si­bil­ities that I’m cur­rently call­ing “pro­saic AGI.” And if you’d ask them to guess what pro­saic AGI would look like, I think that they could have given more like 20–40%.

Se­cond, even if we don’t de­velop pro­saic AGI, I think it is very likely that there will be im­por­tant similar­i­ties be­tween al­ign­ment for pro­saic AGI and al­ign­ment for what­ever kind of AGI we ac­tu­ally build. For ex­am­ple, what­ever AGI we ac­tu­ally build is likely to ex­ploit many of the same tech­niques that a pro­saic AGI would, and to the ex­tent that those tech­niques pose challenges for al­ign­ment we will prob­a­bly have to deal with them one way or an­other.

I think that work­ing with a con­crete model that we have available now is one of the best ways to make progress on al­ign­ment, even in cases where we are sure that there will be at least one qual­i­ta­tive change in how we think about AI.

Third, I think that re­search on al­ign­ment is sig­nifi­cantly more im­por­tant in cases where pow­er­ful AI is de­vel­oped rel­a­tively soon. And in these cases, the prob­a­bil­ity of pro­saic AGI seems to be much higher. If pro­saic AGI is pos­si­ble, then I think there is a sig­nifi­cant chance of build­ing broadly hu­man level AGI over the next 10–20 years. I’d guess that hours of work on al­ign­ment are per­haps 10x more im­por­tant if AI is de­vel­oped in the next 15 years than if it is de­vel­oped later, just based on sim­ple heuris­tics based on diminish­ing marginal re­turns.

3c. Feasibility

Some re­searchers (es­pe­cially at MIRI) be­lieve that al­ign­ing pro­saic AGI is prob­a­bly in­fea­si­ble — that the most likely ap­proach to build­ing an al­igned AI is to un­der­stand in­tel­li­gence in a much deeper way than we cur­rently do, and that if we man­age to build AGI be­fore achiev­ing such an un­der­stand­ing then we are in deep trou­ble.

I think that this shouldn’t make us much less en­thu­si­as­tic about pro­saic AI al­ign­ment:

First, I don’t think it’s rea­son­able to have a con­fi­dent po­si­tion on this ques­tion. Claims of the form “prob­lem X can’t be solved” are re­ally hard to get right, be­cause you are fight­ing against the uni­ver­sal quan­tifier of all pos­si­ble ways that some­one could solve this prob­lem. (This is very similar to the difficulty of say­ing “sys­tem X can’t be com­pro­mised.”) To the ex­tent that there is any ar­gu­ment that al­ign­ing pro­saic AGI is in­fea­si­ble, that ar­gu­ment is nowhere near the level of rigor which would be com­pel­ling.

This im­plies on the one hand that it would be un­wise to as­sign a high prob­a­bil­ity to the in­fea­si­bil­ity of this prob­lem. It im­plies on the other hand that even if the prob­lem is in­fea­si­ble, then we might ex­pect to de­velop a sub­stan­tially more com­plete un­der­stand­ing of why ex­actly it is so difficult.

Se­cond, if this prob­lem is ac­tu­ally in­fea­si­ble, that is an ex­tremely im­por­tant fact with di­rect con­se­quences for what we ought to do. It im­plies we will be un­able to quickly play “catch up” on al­ign­ment af­ter de­vel­op­ing pro­saic AGI, and so we would need to rely on co­or­di­na­tion to pre­vent catas­tro­phe. As a re­sult:

  • We should start prepar­ing for such co­or­di­na­tion im­me­di­ately.

  • It would be worth­while for the AI com­mu­nity to sub­stan­tially change its re­search di­rec­tion in or­der to avoid catas­tro­phe, even though this would in­volve large so­cial costs.

I think we don’t yet have very strong ev­i­dence for the in­tractabil­ity of this prob­lem.

If we could get very strong ev­i­dence, I ex­pect it would have a sig­nifi­cant effect on chang­ing re­searchers’ pri­ori­ties and on the re­search com­mu­nity’s at­ti­tude to­wards AI de­vel­op­ment. Real­is­ti­cally, it’s prob­a­bly also a pre­con­di­tion for get­ting AI re­searchers to make a se­ri­ous move to­wards an al­ter­na­tive ap­proach to AI de­vel­op­ment, or to start talk­ing se­ri­ously about the kind of co­or­di­na­tion that would be needed to cope with hard-to-al­ign AI.


I’ve claimed that pro­saic AGI is con­ceiv­able, that it is a very ap­peal­ing tar­get for re­search on AI al­ign­ment, and that this gives us more rea­son to be en­thu­si­as­tic for the over­all tractabil­ity of al­ign­ment. For now, these ar­gu­ments mo­ti­vate me to fo­cus on pro­saic AGI.

This post was origi­nally pub­lished here on 19th Nov 2016.

The next post in this se­quence will be “Ap­proval-di­rected agents: overview” by Paul Chris­ti­ano, and will re­lease on Thurs­day 22nd Novem­ber.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be “Iter­ated Fixed Point Ex­er­cises” by Scott Garrabrant and Sam Eisen­stat, in the se­quence “Fixed Points”.