Challenges to Christiano’s capability amplification proposal

The fol­low­ing is a ba­si­cally uned­ited sum­mary I wrote up on March 16 of my take on Paul Chris­ti­ano’s AGI al­ign­ment ap­proach (de­scribed in “ALBA” and “Iter­ated Distil­la­tion and Am­plifi­ca­tion”). Where Paul had com­ments and replies, I’ve in­cluded them be­low.


I see a lot of free vari­ables with re­spect to what ex­actly Paul might have in mind. I’ve some­times tried pre­sent­ing Paul with my ob­jec­tions and then he replies in a way that lo­cally an­swers some of my ques­tion but I think would make other difficul­ties worse. My global ob­jec­tion is thus some­thing like, “I don’t see any con­crete setup and con­sis­tent si­mul­ta­neous set­ting of the vari­ables where this whole scheme works.” Th­ese difficul­ties are not minor or tech­ni­cal; they ap­pear to me quite se­vere. I try to walk through the de­tails be­low.

It should be un­der­stood at all times that I do not claim to be able to pass Paul’s ITT for Paul’s view and that this is me crit­i­ciz­ing my own, po­ten­tially straw mi­s­un­der­stand­ing of what I imag­ine Paul might be ad­vo­cat­ing.

Paul Christiano

Over­all take: I think that these are all le­gi­t­i­mate difficul­ties faced by my pro­posal and to a large ex­tent I agree with Eliezer’s ac­count of those prob­lems (though not his ac­count of my cur­rent be­liefs).

I don’t un­der­stand ex­actly how hard Eliezer ex­pects these prob­lems to be; my im­pres­sion is “just about as hard as solv­ing al­ign­ment from scratch,” but I don’t have a clear sense of why.

To some ex­tent we are prob­a­bly dis­agree­ing about al­ter­na­tives. From my per­spec­tive, the difficul­ties with my ap­proach (e.g. bet­ter un­der­stand­ing the forms of op­ti­miza­tion that cause trou­ble, or how to avoid op­ti­miza­tion dae­mons in sys­tems about as smart as you are, or how to ad­dress X-and-only-X) are also prob­lems for al­ter­na­tive al­ign­ment ap­proaches. I think it’s a mis­take to think that tiling agents, or de­ci­sion the­ory, or nat­u­ral­ized in­duc­tion, or log­i­cal un­cer­tainty, are go­ing to make the situ­a­tion qual­i­ta­tively bet­ter for these prob­lems, so work on those prob­lems looks to me like pro­cras­ti­nat­ing on the key difficul­ties. I agree with the in­tu­ition that progress on the agent foun­da­tions agenda “ought to be pos­si­ble,” and I agree that it will help at least a lit­tle bit with the prob­lems Eliezer de­scribes in this doc­u­ment, but over­all agent foun­da­tions seems way less promis­ing than a di­rect at­tack on the prob­lems (given that we haven’t tried the di­rect at­tack nearly enough to give up). Work­ing through philo­soph­i­cal is­sues in the con­text of a con­crete al­ign­ment strat­egy gen­er­ally seems more promis­ing to me than try­ing to think about them in the ab­stract, and I think this is ev­i­denced by the fact that most of the core difficul­ties in my ap­proach would also af­flict re­search based on agent foun­da­tions.

The main way I could see agent foun­da­tions re­search as helping to ad­dress these prob­lems, rather than merely defer­ring them, is if we plan to es­chew large-scale ML al­to­gether. That seems to me like a very se­ri­ous hand­i­cap, so I’d only go that di­rec­tion once I was quite pes­simistic about solv­ing these prob­lems. My sub­jec­tive ex­pe­rience is of mak­ing con­tin­u­ous sig­nifi­cant progress rather than be­ing stuck. I agree there is clear ev­i­dence that the prob­lems are “difficult” in the sense that we are go­ing to have to make progress in or­der to solve them, but not that they are “difficult” in the sense that P vs. NP or even your typ­i­cal open prob­lem in CS is prob­a­bly difficult (and even then if your op­tions were “prove P != NP” or “try to beat Google at build­ing an AGI with­out us­ing large-scale ML,” I don’t think it’s ob­vi­ous which op­tion you should con­sider more promis­ing).


First and fore­most, I don’t un­der­stand how “pre­serv­ing al­ign­ment while am­plify­ing ca­pa­bil­ities” is sup­posed to work at all un­der this sce­nario, in a way con­sis­tent with other things that I’ve un­der­stood Paul to say.

I want to first go through an ob­vi­ous point that I ex­pect Paul and I agree upon: Not ev­ery sys­tem of lo­cally al­igned parts has globally al­igned out­put, and some ad­di­tional as­sump­tion be­yond “the parts are al­igned” is nec­es­sary to yield the con­clu­sion “global be­hav­ior is al­igned”. The straw as­ser­tion “an ag­gre­gate of al­igned parts is al­igned” is the re­verse of the ar­gu­ment that Searle uses to ask us to imag­ine that an (im­mor­tal) hu­man be­ing who speaks only English, who has been trained do things with many many pieces of pa­per that in­stan­ti­ate a Tur­ing ma­chine, can’t be part of a whole sys­tem that un­der­stands Chi­nese, be­cause the in­di­vi­d­ual pieces and steps of the sys­tem aren’t lo­cally im­bued with un­der­stand­ing Chi­nese. Here the com­po­si­tion­ally non-pre­served prop­erty is “lack of un­der­stand­ing of Chi­nese”; we can’t ex­pect “al­ign­ment” to be any more nec­es­sar­ily pre­served than this, ex­cept by fur­ther as­sump­tions.

The sec­ond-to-last time Paul and I con­versed at length, I kept prob­ing Paul for what in prac­tice the non-com­pacted-by-train­ing ver­sion of a big ag­gre­gate of small al­igned agents would look like. He de­scribed peo­ple, liv­ing for a sin­gle day, rout­ing around phone num­bers of other agents with no­body hav­ing any con­cept of the global pic­ture. I used the term “Chi­nese Room Bureau­cracy” to de­scribe this. Paul seemed to think that this was an amus­ing but per­haps not in­ap­pro­pri­ate term.

If no agent in the Chi­nese Room Bureau­cracy has a full view of which ac­tions have which con­se­quences and why, this cuts off the most ob­vi­ous route by which the al­ign­ment of any agent could ap­ply to the al­ign­ment of the whole. The way I usu­ally imag­ine things, the al­ign­ment of an agent ap­plies to things that the agent un­der­stands. If you have a big ag­gre­gate of agents that un­der­stands some­thing the lit­tle lo­cal agent doesn’t un­der­stand, the big ag­gre­gate doesn’t in­herit al­ign­ment from the lit­tle agents. Searle’s Chi­nese Room can un­der­stand Chi­nese even if the per­son in­side it doesn’t un­der­stand Chi­nese, and this cor­re­spond­ingly im­plies, by de­fault, that the per­son in­side the Chi­nese Room is pow­er­less to ex­press their own taste in restau­rant or­ders.

I don’t un­der­stand Paul’s model of how a ton of lit­tle not-so-bright agents yield a big pow­er­ful un­der­stand­ing in ag­gre­gate, in a way that doesn’t effec­tively con­sist of them run­ning AGI code that they don’t un­der­stand.

Paul Christiano

The ar­gu­ment for al­ign­ment isn’t that “a sys­tem made of al­igned neu­rons is al­igned.” Unal­ign­ment isn’t a thing that mag­i­cally hap­pens; it’s the re­sult of spe­cific op­ti­miza­tion pres­sures in the sys­tem that cre­ate trou­ble. My goal is to (a) first con­struct weaker agents who aren’t in­ter­nally do­ing prob­le­matic op­ti­miza­tion, (b) put them to­gether in a way that im­proves ca­pa­bil­ity with­out do­ing other prob­le­matic op­ti­miza­tion, (c) iter­ate that pro­cess.

Paul has pre­vi­ously challenged me to name a bot­tle­neck that I think a Chris­ti­ano-style sys­tem can’t pass. This is hard be­cause (a) I’m not sure I un­der­stand Paul’s sys­tem, and (b) it’s clear­est if I name a task for which we don’t have a pre­sent crisp al­gorithm. But:

The bot­tle­neck I named in my last dis­cus­sion with Paul was, “We have copies of a start­ing agent, which run for at most one cu­mu­la­tive day be­fore be­ing ter­mi­nated, and this agent hasn’t pre­vi­ously learned much math but is smart and can get to un­der­stand­ing alge­bra by the end of the day even though the agent started out know­ing just con­crete ar­ith­metic. How does a sys­tem of such agents, with­out just op­er­at­ing a Tur­ing ma­chine that op­er­ates an AGI, get to the point of in­vent­ing Hes­sian-free op­ti­miza­tion in a neu­ral net?”

This is a slightly ob­so­lete ex­am­ple be­cause no­body uses Hes­sian-free op­ti­miza­tion any­more. But I wanted to find an ex­am­ple of an agent that needed to do some­thing that didn’t have a sim­ple hu­man metaphor. We can un­der­stand sec­ond deriva­tives us­ing metaphors like ac­cel­er­a­tion. “Hes­sian-free op­ti­miza­tion” is some­thing that doesn’t have an ob­vi­ous metaphor that can ex­plain it, well enough to use it in an en­g­ineer­ing de­sign, to some­body who doesn’t have a mathy and not just metaphor­i­cal un­der­stand­ing of calcu­lus. Even if it did have such a metaphor, that metaphor would still be very un­likely to be in­vented by some­one who didn’t un­der­stand calcu­lus.

I don’t see how Paul ex­pects lots of lit­tle agents who can learn alge­bra in a day, be­ing run in se­quence, to ag­gre­gate into some­thing that can build de­signs us­ing Hes­sian-free op­ti­miza­tion, with­out the lit­tle agents hav­ing effec­tively the role of an im­mor­tal dog that’s been trained to op­er­ate a Tur­ing ma­chine. So I also don’t see how Paul ex­pects the pu­ta­tive al­ign­ment of the lit­tle agents to pass through this mys­te­ri­ous ag­gre­ga­tion form of un­der­stand­ing, into al­ign­ment of the sys­tem that un­der­stands Hes­sian-free op­ti­miza­tion.

I ex­pect this is already un­der­stood, but I state as an ob­vi­ous fact that al­ign­ment is not in gen­eral a com­po­si­tion­ally pre­served prop­erty of cog­ni­tive sys­tems: If you train a bunch of good and moral peo­ple to op­er­ate the el­e­ments of a Tur­ing ma­chine and no­body has a global view of what’s go­ing on, their good­ness and moral­ity does not pass through to the Tur­ing ma­chine. Even if we let the good and moral peo­ple have dis­cre­tion as to when to write a differ­ent sym­bol than the usual rules call for, they still can’t be effec­tive at al­ign­ing the global sys­tem, be­cause they don’t in­di­vi­d­u­ally un­der­stand whether the Hes­sian-free op­ti­miza­tion is be­ing used for good or evil, be­cause they don’t un­der­stand Hes­sian-free op­ti­miza­tion or the thoughts that in­cor­po­rate it. So we would not like to rest the sys­tem on the false as­sump­tion “any sys­tem com­posed of al­igned sub­agents is al­igned”, which we know to be gen­er­ally false be­cause of this coun­terex­am­ple. We would like there to in­stead be some nar­rower as­sump­tion, per­haps with ad­di­tional premises, which is ac­tu­ally true, on which the sys­tem’s al­ign­ment rests. I don’t know what nar­rower as­sump­tion Paul wants to use.


Paul asks us to con­sider AlphaGo as a model of ca­pa­bil­ity am­plifi­ca­tion.

My view of AlphaGo would be as fol­lows: We un­der­stand Monte Carlo Tree Search. MCTS is an iter­able al­gorithm whose in­ter­me­di­ate out­puts can be plugged into fur­ther iter­a­tions of the al­gorithm. So we can use su­per­vised learn­ing where our sys­tems of gra­di­ent de­scent can cap­ture and fore­shorten the com­pu­ta­tion of some but not all of the de­tails of win­ning moves re­vealed by the short MCTS, plug in the learned out­puts to MCTS, and get a pseudo-ver­sion of “run­ning MCTS longer and wider” which is weaker than an MCTS ac­tu­ally that broad and deep, but more pow­er­ful than the raw MCTS run pre­vi­ously. The al­ign­ment of this sys­tem is pro­vided by the crisp for­mal loss func­tion at the end of the MCTS.

Here’s an al­ter­nate case where, as far as I can tell, a naive straw ver­sion of ca­pa­bil­ity am­plifi­ca­tion clearly wouldn’t work. Sup­pose we have an RNN that plays Go. It’s been con­structed in such fash­ion that if we iter­ate the RNN for longer, the Go move gets some­what bet­ter. “Aha,” says the straw ca­pa­bil­ity am­plifier, “clearly we can just take this RNN, train an­other net­work to ap­prox­i­mate its in­ter­nal state af­ter 100 iter­a­tions from the ini­tial Go po­si­tion; we feed that in­ter­nal state into the RNN at the start, then train the am­plify­ing net­work to ap­prox­i­mate the in­ter­nal state of that RNN af­ter it runs for an­other 200 iter­a­tions. The re­sult will clearly go on try­ing to ‘win at Go’ be­cause the origi­nal RNN was try­ing to win at Go; the am­plified sys­tem pre­serves the val­ues of the origi­nal.” This doesn’t work be­cause, let us say by hy­poth­e­sis, the RNN can’t get ar­bi­trar­ily bet­ter at Go if you go on iter­at­ing it; and the na­ture of the ca­pa­bil­ity am­plifi­ca­tion setup doesn’t per­mit any out­side loss func­tion that could tell the am­plified RNN whether it’s do­ing bet­ter or worse at Go.

Paul Christiano

I definitely agree that am­plifi­ca­tion doesn’t work bet­ter than “let the hu­man think for ar­bi­trar­ily long.” I don’t think that’s a strong ob­jec­tion, be­cause I think hu­mans (even hu­mans who only have a short pe­riod of time) will even­tu­ally con­verge to good enough an­swers to the ques­tions we face.

The RNN has only what­ever opinion it con­verges to, or what­ever set of opinions it di­verges to, to tell it­self how well it’s do­ing. This is ex­actly what it is for ca­pa­bil­ity am­plifi­ca­tion to pre­serve al­ign­ment; but this in turn means that ca­pa­bil­ity am­plifi­ca­tion only works to the ex­tent that what we are am­plify­ing has within it­self the ca­pa­bil­ity to be very smart in the limit.

If we’re effec­tively con­struct­ing a civ­i­liza­tion of long-lived Paul Chris­ti­anos, then this difficulty is some­what alle­vi­ated. There are still things that can go wrong with this civ­i­liza­tion qua civ­i­liza­tion (even aside from ob­jec­tions I name later as to whether we can ac­tu­ally safely and re­al­is­ti­cally do that). I do how­ever be­lieve that a civ­i­liza­tion of Pauls could do nice things.

But other parts of Paul’s story don’t per­mit this, or at least that’s what Paul was say­ing last time; Paul’s su­per­vised learn­ing setup only lets the simu­lated com­po­nent peo­ple op­er­ate for a day, be­cause we can’t get enough la­beled cases if the peo­ple have to each run for a month.

Fur­ther­more, as I un­der­stand it, the “re­al­is­tic” ver­sion of this is sup­posed to start with agents dumber than Paul. Ac­cord­ing to my un­der­stand­ing of some­thing Paul said in an­swer to a later ob­jec­tion, the agents in the sys­tem are sup­posed to be even dumber than an av­er­age hu­man (but al­igned). It is not at all ob­vi­ous to me that an ar­bi­trar­ily large sys­tem of agents with IQ 90, who each only live for one day, can im­ple­ment a much smarter agent in a fash­ion analo­gous to the in­ter­nal agents them­selves achiev­ing un­der­stand­ings to which they can ap­ply their al­ign­ment in a globally effec­tive way, rather than them blindly im­ple­ment­ing a larger al­gorithm they don’t un­der­stand.

I’m not sure a sys­tem of one-day-liv­ing IQ-90 hu­mans ever gets to the point of in­vent­ing fire or the wheel.

If Paul has an in­tu­ition say­ing “Well, of course they even­tu­ally start do­ing Hes­sian-free op­ti­miza­tion in a way that makes their un­der­stand­ing effec­tive upon it to cre­ate global al­ign­ment; I can’t figure out how to con­vince you oth­er­wise if you don’t already see that,” I’m not quite sure where to go from there, ex­cept on­wards to my other challenges.

Paul Christiano

Well, I can see one ob­vi­ous way to con­vince you oth­er­wise: ac­tu­ally run the ex­per­i­ment. But be­fore do­ing that I’d like to be more pre­cise about what you ex­pect to work and not work, since I’m not go­ing to liter­ally do the HF op­ti­miza­tion ex­am­ple (de­vel­op­ing new al­gorithms is way, way be­yond the scope of ex­ist­ing ML). I think we can do stuff that looks (to me) even harder than in­vent­ing HF op­ti­miza­tion. But I don’t know if I have a good enough model of your model to know what you’d ac­tu­ally con­sider harder.

Un­less of course you have so many agents in the (un­com­pressed) ag­gre­gate that the ag­gre­gate im­ple­ments a smarter ge­netic al­gorithm that is max­i­miz­ing the ap­proval of the in­ter­nal agents. If you take some­thing much smarter than IQ 90 hu­mans liv­ing for one day, and train it to get the IQ 90 hu­mans to out­put large num­bers sig­nal­ing their ap­proval, I would by de­fault ex­pect it to hack the IQ 90 one-day hu­mans, who are not se­cure sys­tems. We’re back to the global sys­tem be­ing smarter than the in­di­vi­d­ual agents in a way which doesn’t pre­serve al­ign­ment.

Paul Christiano

Definitely agree that even if the agents are al­igned, they can im­ple­ment un­al­igned op­ti­miza­tion, and then we’re back to square one. Am­plifi­ca­tion only works if we can im­prove ca­pa­bil­ity with­out do­ing un­al­igned op­ti­miza­tion. I think this is a dis­agree­ment about the de­com­pos­abil­ity of cog­ni­tive work. I hope we can re­solve it by ac­tu­ally find­ing con­crete, sim­ple tasks where we have differ­ing in­tu­itions, and then do­ing em­piri­cal tests.

The cen­tral in­ter­est­ing-to-me idea in ca­pa­bil­ity am­plifi­ca­tion is that by ex­actly imi­tat­ing hu­mans, we can by­pass the usual dooms of re­in­force­ment learn­ing. If ar­guendo you can con­struct an ex­act imi­ta­tion of a hu­man, it pos­sesses ex­actly the same al­ign­ment prop­er­ties as the hu­man; and this is true in a way that is not true if we take a re­in­force­ment learner and ask it to max­i­mize an ap­proval sig­nal origi­nat­ing from the hu­man. (If the sub­ject is Paul Chris­ti­ano, or Carl Shul­man, I for one am will­ing to say these hu­mans are rea­son­ably al­igned; and I’m pretty much okay with some­body giv­ing them the keys to the uni­verse in ex­pec­ta­tion that the keys will later be handed back.)

It is not ob­vi­ous to me how fast al­ign­ment-preser­va­tion de­grades as the ex­act­ness of the imi­ta­tion is weak­ened. This mat­ters be­cause of things Paul has said which sound to me like he’s not ad­vo­cat­ing for perfect imi­ta­tion, in re­sponse to challenges I’ve given about how perfect imi­ta­tion would be very ex­pen­sive. That is, the an­swer he gave to a challenge about the ex­pense of perfec­tion makes the an­swer to “How fast do we lose al­ign­ment guaran­tees as we move away from perfec­tion?” be­come very im­por­tant.

One ex­am­ple of a doom I’d ex­pect from stan­dard re­in­force­ment learn­ing would be what I’d term the “X-and-only-X” prob­lem. I un­for­tu­nately haven’t writ­ten this up yet, so I’m go­ing to try to sum­ma­rize it briefly here.

X-and-only-X is what I call the is­sue where the prop­erty that’s easy to ver­ify and train is X, but the prop­erty you want is “this was op­ti­mized for X and only X and doesn’t con­tain a whole bunch of pos­si­ble sub­tle bad Ys that could be hard to de­tect for­mu­laically from the fi­nal out­put of the sys­tem”.

For ex­am­ple, imag­ine X is “give me a pro­gram which solves a Ru­bik’s Cube”. You can run the pro­gram and ver­ify that it solves Ru­bik’s Cubes, and use a loss func­tion over its av­er­age perfor­mance which also takes into ac­count how many steps the pro­gram’s solu­tions re­quire.

The prop­erty Y is that the pro­gram the AI gives you also mod­u­lates RAM to send GSM cel­l­phone sig­nals.

That is: It’s much eas­ier to ver­ify “This is a pro­gram which at least solves the Ru­bik’s Cube” than “This is a pro­gram which was op­ti­mized to solve the Ru­bik’s Cube and only that and was not op­ti­mized for any­thing else on the side.”

If I were go­ing to talk about try­ing to do al­igned AGI un­der the stan­dard ML paradigms, I’d talk about how this cre­ates a differ­en­tial ease of de­vel­op­ment be­tween “build a sys­tem that does X” and “build a sys­tem that does X and only X and not Y in some sub­tle way”. If you just want X how­ever un­safely, you can build the X-clas­sifier and use that as a loss func­tion and let re­in­force­ment learn­ing loose with what­ever equiv­a­lent of gra­di­ent de­scent or other generic op­ti­miza­tion method the fu­ture uses. If the safety prop­erty you want is op­ti­mized-for-X-and-just-X-and-not-any-pos­si­ble-num­ber-of-hid­den-Ys, then you can’t write a sim­ple loss func­tion for that the way you can for X.

Paul Christiano

Ac­cord­ing to my un­der­stand­ing of op­ti­miza­tion /​ use of lan­guage: the agent pro­duced by RL is op­ti­mized only for X. How­ever, op­ti­miza­tion for X is li­able to pro­duce a Y-op­ti­mizer. So the ac­tions of the agent are both X-op­ti­mized and Y-op­ti­mized.

The team that’s build­ing a less safe AGI can plug in the X-eval­u­a­tor and let rip, the team that wants to build a safe AGI can’t do things the easy way and has to solve new ba­sic prob­lems in or­der to get a trust­wor­thy sys­tem. It’s not un­solv­able, but it’s an el­e­ment of the class of added difficul­ties of al­ign­ment such that the whole class ex­tremely plau­si­bly adds up to an ex­tra two years of de­vel­op­ment.

In Paul’s ca­pa­bil­ity-am­plifi­ca­tion sce­nario, if we can get ex­act imi­ta­tion, we are gen­uinely com­pletely by­pass­ing the whole paradigm that cre­ates the X-and-only-X prob­lem. If you can get ex­act imi­ta­tion of a hu­man, the out­puts have only and ex­actly what­ever prop­er­ties the hu­man already has. This kind of gen­uinely differ­ent view­point is why I con­tinue to be ex­cited about Paul’s think­ing.

Paul Christiano

I agree that perfect imi­ta­tion would be a way to get around the X-and-only-X prob­lem. How­ever, I don’t think that it’s plau­si­ble and it’s not how my ap­proach hopes to get around the X-and-only-X prob­lem.

I would solve X-and-only-X in two steps:

First, given an agent and an ac­tion which has been op­ti­mized for un­de­sir­able con­se­quence Y, we’d like to be able to tell that the ac­tion has this un­de­sir­able side effect. I think we can do this by hav­ing a smarter agent act as an over­seer, and giv­ing the smarter agent suit­able in­sight into the cog­ni­tion of the weaker agent (e.g. by shar­ing weights be­tween the weak agent and an ex­pla­na­tion-gen­er­at­ing agent). This is what I’m call­ing in­formed over­sight.

Se­cond, given an agent, iden­tify situ­a­tions in which it is es­pe­cially likely to pro­duce bad out­comes, or proofs that it won’t, or enough un­der­stand­ing of its in­ter­nals that you can see why it won’t. This is dis­cussed in “Tech­niques for Op­ti­miz­ing Worst-Case Perfor­mance.”

(It also ob­vi­ously re­quires a smarter agent, which you hope to get by in­duc­tion + am­plifi­ca­tion).

I think that both of those are hard prob­lems, in ad­di­tion to the as­sump­tion that am­plifi­ca­tion will work. But I don’t yet see rea­son to be su­per pes­simistic about ei­ther of them.

On the other hand, sup­pose we don’t have ex­act imi­ta­tion. How fast do we lose the defense against X-and-only-X? Well, that de­pends on the in­ex­act­ness of the imi­ta­tion; un­der what kind of dis­tance met­ric is the im­perfect imi­ta­tion ‘near’ to the origi­nal? Like, if we’re talk­ing about Eu­clidean dis­tance in the out­put, I ex­pect you lose the X-and-only-X guaran­tee pretty damn fast against smart ad­ver­sar­ial per­tur­ba­tions.

On the other other hand, sup­pose that the in­ex­act­ness of the imi­ta­tion is “This agent be­haves ex­actly like Paul Chris­ti­ano but 5 IQ points dumber.” If this is only and pre­cisely the form of in­ex­act­ness pro­duced, and we know that for sure, then I’d say we have a pretty good guaran­tee against slightly-dumber-Paul pro­duc­ing the likes of Ru­bik’s Cube solvers con­tain­ing hid­den GSM sig­nalers.

On the other other other hand, sup­pose the in­ex­act­ness of the imi­ta­tion is “This agent passes the Tur­ing Test; a hu­man can’t tell it apart from a hu­man.” Then X-and-only-X is thrown com­pletely out the win­dow. We have no guaran­tee of non-Y for any Y a hu­man can’t de­tect, which cov­ers an enor­mous amount of lethal ter­ri­tory, which is why we can’t just san­i­tize the out­puts of an un­trusted su­per­in­tel­li­gence by hav­ing a hu­man in­spect the out­puts to see if they have any hu­manly ob­vi­ous bad con­se­quences.


Speak­ing of in­ex­act imi­ta­tion: It seems to me that hav­ing an AI out­put a high-fidelity imi­ta­tion of hu­man be­hav­ior, suffi­ciently high-fidelity to pre­serve prop­er­ties like “be­ing smart” and “be­ing a good per­son” and “still be­ing a good per­son un­der some odd strains like be­ing as­sem­bled into an enor­mous Chi­nese Room Bureau­cracy”, is a pretty huge ask.

It seems to me ob­vi­ous, though this is the sort of point where I’ve been sur­prised about what other peo­ple don’t con­sider ob­vi­ous, that in gen­eral ex­act imi­ta­tion is a big­ger ask than su­pe­rior ca­pa­bil­ity. Build­ing a Go player that imi­tates Shu­usaku’s Go play so well that a scholar couldn’t tell the differ­ence, is a big­ger ask than build­ing a Go player that could defeat Shu­usaku in a match. A hu­man is much smarter than a pocket calcu­la­tor but would still be un­able to imi­tate one with­out us­ing a pa­per and pen­cil; to imi­tate the pocket calcu­la­tor you need all of the pocket calcu­la­tor’s abil­ities in ad­di­tion to your own.

Cor­re­spond­ingly, a re­al­is­tic AI we build that liter­ally passes the strong ver­sion of the Tur­ing Test would prob­a­bly have to be much smarter than the other hu­mans in the test, prob­a­bly smarter than any hu­man on Earth, be­cause it would have to pos­sess all the hu­man ca­pa­bil­ities in ad­di­tion to its own. Or at least all the hu­man ca­pa­bil­ities that can be ex­hibited to an­other hu­man over the course of how­ever long the Tur­ing Test lasts. (Note that on the ver­sion of ca­pa­bil­ity am­plifi­ca­tion I heard, ca­pa­bil­ities that can be ex­hibited over the course of a day are the only kinds of ca­pa­bil­ities we’re al­lowed to am­plify.)

Paul Christiano

To­tally agree, and for this rea­son I agree that you can’t rely on perfect imi­ta­tion to solve the X-and-only-X prob­lem and hence need other solu­tions. If you con­vince me that ei­ther in­formed over­sight or re­li­a­bil­ity is im­pos­si­ble, then I’ll be largely con­vinced that I’m doomed.

An AI that learns to ex­actly imi­tate hu­mans, not just pass­ing the Tur­ing Test to the limits of hu­man dis­crim­i­na­tion on hu­man in­spec­tion, but perfect imi­ta­tion with all added bad sub­tle prop­er­ties thereby ex­cluded, must be so cog­ni­tively pow­er­ful that its learn­able hy­poth­e­sis space in­cludes sys­tems equiv­a­lent to en­tire hu­man brains. I see no way that we’re not talk­ing about a su­per­in­tel­li­gence here.

So to pos­tu­late perfect imi­ta­tion, we would first of all run into the prob­lems that:

(a) The AGI re­quired to learn this imi­ta­tion is ex­tremely pow­er­ful, and this could im­ply a dan­ger­ous de­lay be­tween when we can build any dan­ger­ous AGI at all, and when we can build AGIs that would work for al­ign­ment us­ing perfect-imi­ta­tion ca­pa­bil­ity am­plifi­ca­tion.

(b) Since we can­not in­voke a perfect-imi­ta­tion ca­pa­bil­ity am­plifi­ca­tion setup to get this very pow­er­ful AGI in the first place (be­cause it is already the least AGI that we can use to even get started on perfect-imi­ta­tion ca­pa­bil­ity am­plifi­ca­tion), we already have an ex­tremely dan­ger­ous un­al­igned su­per­in­tel­li­gence sit­ting around that we are try­ing to use to im­ple­ment our scheme for al­ign­ment.

Now, we may per­haps re­ply that the imi­ta­tion is less than perfect and can be done with a dumber, less dan­ger­ous AI; per­haps even so dumb as to not be enor­mously su­per­in­tel­li­gent. But then we are tweak­ing the “perfec­tion of imi­ta­tion” set­ting, which could rapidly blow up our al­ign­ment guaran­tees against the stan­dard dooms of stan­dard ma­chine learn­ing paradigms.

I’m wor­ried that you have to de­grade the level of imi­ta­tion a lot be­fore it be­comes less than an enor­mous ask, to the point that what’s be­ing imi­tated isn’t very in­tel­li­gent, isn’t hu­man, and/​or isn’t known to be al­igned.

To be spe­cific: I think that if you want to imi­tate IQ-90 hu­mans think­ing for one day, and imi­tate them so speci­fi­cally that the imi­ta­tions are gen­er­ally in­tel­li­gent and lo­cally al­igned even in the limit of be­ing ag­gre­gated into weird bu­reau­cra­cies, you’re look­ing at an AGI pow­er­ful enough to think about whole sys­tems loosely analo­gous to IQ-90 hu­mans.

Paul Christiano

It’s im­por­tant that my ar­gu­ment for al­ign­ment-of-am­plifi­ca­tion goes through not do­ing prob­le­matic op­ti­miza­tion. So if we com­bine that with a good enough solu­tion to in­formed over­sight and re­li­a­bil­ity (and am­plifi­ca­tion, and the in­duc­tion work­ing so far...), then we can con­tinue to train im­perfect imi­ta­tions that definitely don’t do prob­le­matic op­ti­miza­tion. They’ll mess up all over the place, and so might not be able to be com­pe­tent (an­other prob­lem am­plifi­ca­tion needs to han­dle), but the goal is to set things up so that be­ing a lot dumber doesn’t break al­ign­ment.

I think that is a very pow­er­ful AGI. I think this AGI is smart enough to slip all kinds of shenani­gans past you, un­less you are us­ing a method­ol­ogy that can pro­duce faith­ful imi­ta­tions from un­al­igned AGIs. I think this is an AGI that can do pow­er­ful feats of en­g­ineer­ing, un­less it is some­how able to simu­late hu­mans do­ing pow­er­ful feats of en­g­ineer­ing with­out it­self be­ing ca­pa­ble of pow­er­ful feats of en­g­ineer­ing.

And then fur­ther­more the ca­pa­bil­ity am­plifi­ca­tion schema re­quires the AGI to be pow­er­ful enough to learn to imi­tate am­plified sys­tems of hu­mans as well!

If you don’t de­grade the AI to well be­low the point of su­per­in­tel­li­gence, we re­quire ex­tremely strong Fara­day-cage keep-God-in-a-box don’t-peek-at-the-code AI box­ing. So we can­not for ex­am­ple let an un­trusted su­per­in­tel­li­gence origi­nate queries that it can use to learn hu­man be­hav­ior; it has to be strictly un­su­per­vised ex­am­ple-based learn­ing rather than a query model. This in­ter­acts with wor­ries about sam­ple com­plex­ity be­low.

Paul Christiano

I ap­pre­ci­ate how of­ten you had to ar­gue against dis­junc­tions here in or­der to give me the benefit of the doubt.

Or you need to pos­tu­late some prior al­ign­ment of the AGI that lets you ro­bustly and safely di­rect it at the imi­ta­tion prob­lem. Then (a) your al­ign­ment schema de­pends on an un­speci­fied prior al­ign­ment schema that already works to make very pow­er­ful AGIs do things. And (b) you prob­a­bly aren’t al­low­ing un­re­strained in­tel­li­gence ex­plo­sions, which prob­a­bly in­tro­duces a big un­avoid­able de­lay be­tween the first point where very pow­er­ful AGI is tech­nolog­i­cally pos­si­ble, and the later point where you can build and safely al­ign the ex­tremely pow­er­ful AGI that you’re us­ing as the base unit in your imi­ta­tion scheme.


The last I checked with Paul, his scheme bit the bul­let on us­ing pow­er­ful smart dan­ger­ous units to imi­tate hu­mans. Paul’s scheme for by­pass­ing this, as I heard and un­der­stood it back then, in­volved a scheme which:

  • Trained a mil­lion smart pre­dic­tors of suffi­cient di­ver­sity that at least one pre­dic­tor would hap­pen to end up with the true sole sim­ple goal of pre­dict­ing faith­fully;

  • Com­pared the pre­dic­tors against each other, find­ing cases where their pre­dic­tions di­verged suffi­ciently;

  • Tested them against re­al­ity;

  • Had enough sam­pling power to perform a suffi­cient num­ber of tests that, even if a su­per­in­tel­li­gence cun­ningly de­lays the time when it starts try­ing to slip in bad pre­dic­tions, we can pause the sys­tem and run the ex­per­i­men­tal test and gen­er­ate new sam­ples and kick out the un­faith­ful pre­dic­tor at that time.

That’s a one-mil­lion­fold perfor­mance penalty, if I un­der­stood it cor­rectly.

If that was just a con­crete illus­tra­tion to help me un­der­stand what’s hap­pen­ing at all, and ac­tu­ally there’s sup­posed to be some ob­vi­ous less naive more effi­cient scheme for mak­ing use of mostly un­trusted su­per­in­tel­li­gences which doesn’t have the gi­ant perfor­mance penalty, then I haven’t seen it yet.

Paul Christiano

There are sev­eral pos­si­ble tech­niques for cut­ting this perfor­mance penalty down, e.g., rep­re­sent­ing a large en­sem­ble im­plic­itly or us­ing a tricky dis­til­la­tion pro­ce­dure. I to­tally agree if you have a mil­lion­fold slow­down then you are out of luck.

When we talked about this, I agreed that “train a re­li­able sys­tem” was an im­por­tant sub­prob­lem. Train an im­plicit en­sem­ble is one pos­si­ble ap­proach to that prob­lem. At the time of our dis­cus­sion I listed 3 tech­niques. Since then I’ve be­come quite pes­simistic about train­ing an im­plicit en­sem­ble, and have be­come more op­ti­mistic about ver­ifi­ca­tion and trans­parency. (My think­ing about the prob­lem is also gen­er­ally much sharper.) The three ap­proaches I cur­rently con­sider most live are the ones in “Tech­niques for Op­ti­miz­ing Worst-Case Perfor­mance.”

(As usual, all of it re­quires am­plifi­ca­tion + the in­duc­tion.)

(In the origi­nal dis­cus­sion, I listed “large en­sem­ble” as a pos­si­ble ap­proach in re­sponse to your “how could you pos­si­bly solve this?” ques­tion, not as a nec­es­sary in­gre­di­ent.)


I also have trou­ble see­ing how kick­ing out un­faith­ful pre­dic­tors is pos­si­ble un­der an un­su­per­vised-learn­ing paradigm that doesn’t ac­tu­ally al­low hy­per­com­pu­ta­tion.

The prob­lem be­ing: the ex­act out­put of a hu­man is always go­ing to be very im­prob­a­ble. Even if the agents can only send each other text mes­sages and no illus­tra­tions (which fur­ther re­duces their col­lec­tive in­tel­li­gence and abil­ity to pro­duce text­books read­able in a day), a text mes­sage is still an enor­mous pos­si­bil­ity space. If we ask su­per­in­tel­li­gences to give us their pre­dic­tions for what the hu­man does next, their pre­dic­tions can­not pos­si­bly ac­tu­ally in­clude a gi­ant lookup table next to all the prob­a­ble out­puts of the hu­man.

I think I see how I’d “solve” this prob­lem us­ing hy­per­com­pu­ta­tion and an enor­mous num­ber of sam­ples: I re­quire the su­per­in­tel­li­gences to give me their gi­ant lookup ta­bles of prob­a­bil­ities for ex­act out­puts, calcu­late to­tal dis­tances be­tween prob­a­bil­ity ta­bles (KL di­ver­gence or what­ever), and if the dis­tance is great enough, I sam­ple a hu­man and do a Bayesian up­date. I drop any su­per­in­tel­li­gences whose prior prob­a­bil­ity goes low enough. I keep do­ing this un­til only agree­ing su­per­in­tel­li­gences re­main.

Un­der this sys­tem it looks to me like you need a lot of sam­ples to kick out any su­per­in­tel­li­gences try­ing to sneak in bad sug­ges­tions. Let’s say that an un­com­pressed Chi­nese Room bu­reau­cracy is do­ing the equiv­a­lent of send­ing around a quadrillion mes­sages, and that the out­put can be cor­rupted by an ad­ver­sar­ial at­tack against one billion of those mes­sages such that there’s a chance of out­putting an ad­ver­sar­i­ally cho­sen al­ter­na­tive to each of those billion mes­sages. Well, the only way a su­per­in­tel­li­gence can al­lo­cate a gen­er­a­tion prob­a­bil­ity to those fake mes­sages, is by steal­ing 0.0001% prob­a­bil­ity mass from all the other mes­sages that might be out­put. So each time we check against a real hu­man, we up­date by bits, and af­ter tak­ing a mil­lion sam­ples we’ve driven down the rel­a­tive odds of this naughty su­per­in­tel­li­gence by .

We had to take a mil­lion sam­ples to do that, and each of the sam­ples in­volves a hu­man think­ing for one day. And that just got us :1 ev­i­dence, so we’ve got to take a mil­lion sam­ples 7 times to get to the point of 1000:1 odds against the su­per­in­tel­li­gence be­ing hon­est. And if we have a mil­lion su­per­in­tel­li­gences in the dis­tri­bu­tion, some of them can make skewed bets à la pre­dict­ing coin­flips whereby a thou­sand of those will be able to gain 1000:1 ad­van­tages over the oth­ers, so ac­tu­ally you’ve got to take 14 mil­lion sam­ples to be sure of beat­ing any bad pre­dic­tor classes that col­lec­tively try that scheme.

I think last time I talked to Paul about this, Paul was like, “Sure, we’re talk­ing about a gi­gan­tic gov­ern­ment pro­ject that can take mil­lions of sam­ples.”

I’m not sure, I could be wrong about this, but I think you ac­tu­ally need 14 mil­lion sam­ples per bad pre­dic­tor that you want to kick out of the dis­tri­bu­tion.

Even if I’m wrong about that, we need mil­lions of hu­mans who are smart enough and al­igned enough that ran­dom sam­ples from their num­ber can ag­glomer­ate into a smart al­igned Chi­nese Room Bureau­cracy.

I also don’t see a way to do this with­out com­par­ing the gi­ant lookup ta­bles us­ing hy­per­com­pu­ta­tion, but there I might just be miss­ing a math trick.

Paul Christiano

My best guess is that this can be done un­der plau­si­ble as­sump­tions with O(1) sam­ples per bad pre­dic­tor. It’s tricky, but we are good at tricky math prob­lems, so it’s not very scary rel­a­tive to the other prob­lems we face.

(Un­for­tu­nately, I think that a large im­plicit en­sem­ble is prob­a­bly stuck any­way, in part be­cause a mil­lion pre­dic­tors isn’t enough. But I’m not con­fi­dent about that.)

(If you’re think­ing of GANs, then so far as I can tell, the dis­crim­i­na­tor has to be at least as smart as the gen­er­a­tor, and you have to trust the dis­crim­i­na­tor, and there isn’t any sim­pler prob­lem with re­spect to how you find a trusted su­per­in­tel­li­gence within a col­lec­tive of un­trusted ones to act as your dis­crim­i­na­tor.)

(EDIT: Ac­tu­ally, af­ter think­ing about this for an­other five min­utes, maybe I do see how to do it with GANs and lower sam­ple com­plex­ity.)

An even larger is­sue is that I don’t see any ob­vi­ous way to carry out a scheme like this one at all with re­spect to im­perfect imi­ta­tions. (And the above scheme I thought of with GANs would also just fail.)

Paul Christiano

I think we could prob­a­bly get over this too, it’s an­other tricky math prob­lem. I think this kind of prob­lem is re­li­ably ei­ther im­pos­si­ble, or else rad­i­cally eas­ier than most of the other stuff we are deal­ing with in al­ign­ment.

(Though I en­dorse the over­all in­tu­ition that large im­plicit en­sem­bles are doomed.)


I think these ar­gu­ments are col­lec­tively some­thing like a crux. That is, un­less I’ve missed one of my own thought pro­cesses in the course of writ­ing this up rapidly, or as­sumed a shared back­ground as­sump­tion that isn’t ac­tu­ally shared.

Let’s say that D is the de­gree of im­perfec­tion al­lowed by some sys­tem of ca­pa­bil­ity am­plifi­ca­tion, and call D-im­perfect imi­ta­tions D-imi­ta­tions. Iter­ated D-imi­ta­tions of am­plified sys­tems of D-imi­ta­tions will be termed DD-imi­ta­tions. Then I think I’d start to be prag­mat­i­cally in­ter­ested in ca­pa­bil­ity am­plifi­ca­tion as I un­der­stood it, if I be­lieved all of the fol­low­ing:

  • We can, be­fore the world is ended by other un­al­igned AIs, get AIs pow­er­ful enough to learn D-imi­ta­tions and DD-imi­ta­tions;

  • D-imi­ta­tions and DD-imi­ta­tions ro­bustly pre­serve the good­ness of the peo­ple be­ing imi­tated, de­spite the im­perfec­tion of the imi­ta­tion;

  • D-imi­ta­tions ag­glomer­ate to suffi­cient cog­ni­tive power to perform a pivotal act in a way that causes the al­ign­ment of the com­po­nents to be effec­tive upon al­ign­ing the whole; and im­perfect DD-imi­ta­tion pre­serves this prop­erty;

  • We can find any way of ei­ther:

    • In­di­vi­d­u­ally trust­ing one AI that pow­er­ful to faith­fully perform the task of D-imi­ta­tion (but then why can’t we just use this scheme to al­ign a pow­er­ful AGI in the first place?);

    • Find a scheme for ag­glomer­at­ing mostly un­trust­wor­thy pow­er­ful in­tel­li­gences which:

      • Doesn’t re­quire gi­ant lookup ta­bles, doesn’t re­quire a GAN with a trusted dis­crim­i­na­tor un­less you can say how to pro­duce the trusted dis­crim­i­na­tor, and can use ac­tual hu­man sam­ples as fuel to dis­crim­i­nate trust­wor­thi­ness among un­trusted gen­er­a­tors of D-imi­ta­tions.

      • Is ex­tremely sam­ple-effi­cient (let’s say you can clear 100 peo­ple who are trust­wor­thy to be part of an am­plified-ca­pa­bil­ity sys­tem, which already sounds to me like a huge damned ask); or you can ex­hibit to me a so­cial schema which ag­glomer­ates mostly un­trusted hu­mans into a Chi­nese Room Bureau­cracy that we trust to perform a pivotal task, and a poli­ti­cal schema that you trust to do things in­volv­ing mil­lions of hu­mans, in which case you can take mil­lions of sam­ples but not billions. Hon­estly, I just don’t cur­rently be­lieve in AI sce­nar­ios in which good and trust­wor­thy gov­ern­ments carry out com­pli­cated AI al­ign­ment schemas in­volv­ing mil­lions of peo­ple, so if you go down this path we end up with differ­ent cruxes; but I would already be pretty im­pressed if you got all the other cruxes.

      • Is not too com­pu­ta­tion­ally in­effi­cient; more like 20-1 slow­down than 1,000,000-1. Be­cause I don’t think you can get the lat­ter de­gree of ad­van­tage over other AGI pro­jects el­se­where in the world. Un­less you are pos­tu­lat­ing mas­sive global perfect surveillance schemes that don’t wreck hu­man­ity’s fu­ture, car­ried out by hy­per-com­pe­tent, hy­per-trust­wor­thy great pow­ers with a deep com­mit­ment to cos­mopoli­tan value — very un­like the ob­served char­ac­ter­is­tics of pre­sent great pow­ers, and go­ing un­op­posed by any other ma­jor gov­ern­ment. Again, if we go down this branch of the challenge then we are no longer at the origi­nal crux.

I worry that go­ing down the last two branches of the challenge could cre­ate the illu­sion of a poli­ti­cal dis­agree­ment, when I have what seem to me like strong tech­ni­cal ob­jec­tions at the pre­vi­ous branches. I would pre­fer that the more tech­ni­cal cruxes be con­sid­ered first. If Paul an­swered all the other tech­ni­cal cruxes and pre­sented a scheme for ca­pa­bil­ity am­plifi­ca­tion that worked with a mod­er­ately utopian world gov­ern­ment, I would already have been sur­prised. I wouldn’t ac­tu­ally try it be­cause you can­not get a mod­er­ately utopian world gov­ern­ment, but Paul would have won many points and I would be in­ter­ested in try­ing to re­fine the scheme fur­ther be­cause it had already been re­fined fur­ther than I thought pos­si­ble. On my pre­sent view, try­ing any­thing like this should ei­ther just plain not get started (if you wait to satisfy ex­treme com­pu­ta­tional de­mands and sam­pling power be­fore pro­ceed­ing), just plain fail (if you use weak AIs to try to imi­tate hu­mans), or just plain kill you (if you use a su­per­in­tel­li­gence).

Paul Christiano

I think that the dis­agree­ment is al­most en­tirely tech­ni­cal. I think if we re­ally needed 1M peo­ple it wouldn’t be a dealbreaker, but that’s be­cause of a tech­ni­cal rather than poli­ti­cal dis­agree­ment (about what those peo­ple need to be do­ing). And I agree that 1,000,000x slow­down is un­ac­cept­able (I think even a 10x slow­down is al­most to­tally doomed).

I restate that these ob­jec­tions seem to me to col­lec­tively sum up to “This is fun­da­men­tally just not a way you can get an al­igned pow­er­ful AGI un­less you already have an al­igned su­per­in­tel­li­gence”, rather than “Some fur­ther in­sights are re­quired for this to work in prac­tice.” But who knows what fur­ther in­sights may re­ally bring? Move­ment in thoughtspace con­sists of bet­ter un­der­stand­ing, not clev­erer tools.

I con­tinue to be ex­cited by Paul’s think­ing on this sub­ject; I just don’t think it works in the pre­sent state.

Paul Christiano

On this point, we agree. I don’t think any­one is claiming to be done with the al­ign­ment prob­lem, the main ques­tion is about what di­rec­tions are most promis­ing for mak­ing progress.

On my view, this is not an un­usual state of mind to be in with re­spect to al­ign­ment re­search. I can’t point to any MIRI pa­per that works to al­ign an AGI. Other peo­ple seem to think that they ought to cur­rently be in a state of hav­ing a pretty much work­able scheme for al­ign­ing an AGI, which I would con­sider to be an odd ex­pec­ta­tion. I would think that a sane point of view con­sisted in hav­ing ideas for ad­dress­ing some prob­lems that cre­ated fur­ther difficul­ties that needed to be fixed and didn’t ad­dress most other prob­lems at all; a map with what you think are the big un­solved ar­eas clearly marked. Be­ing able to have a thought which gen­uinely squarely at­tacks any al­ign­ment difficulty at all de­spite any other difficul­ties it im­plies, is already in my view a large and un­usual ac­com­plish­ment. The in­sight “trust­wor­thy imi­ta­tion of hu­man ex­ter­nal be­hav­ior would avert many de­fault dooms as they man­i­fest in ex­ter­nal be­hav­ior un­like hu­man be­hav­ior” may prove vi­tal at some point. I con­tinue to recom­mend throw­ing as much money at Paul as he says he can use, and I wish he said he knew how to use larger amounts of money.