Paul’s research agenda FAQ

I think Paul Chris­ti­ano’s re­search agenda for the al­ign­ment of su­per­in­tel­li­gent AGIs pre­sents one of the most ex­cit­ing and promis­ing ap­proaches to AI safety. After be­ing very con­fused about Paul’s agenda, chat­ting with oth­ers about similar con­fu­sions, and clar­ify­ing with Paul many times over, I’ve de­cided to write a FAQ ad­dress­ing com­mon con­fu­sions around his agenda.

This FAQ is not in­tended to provide an in­tro­duc­tion to Paul’s agenda, nor is it in­tended to provide an air­tight defense. This FAQ only aims to clar­ify com­monly mi­s­un­der­stood as­pects of the agenda. Un­less oth­er­wise stated, all views are my own views of Paul’s views. (ETA: Paul does not have ma­jor dis­agree­ments with any­thing ex­pressed in this FAQ. There are many small points he might have ex­pressed differ­ently, but he en­dorses this as a rea­son­able rep­re­sen­ta­tion of his views. This is in con­trast with pre­vi­ous drafts of this FAQ, which did con­tain se­ri­ous er­rors he asked to have cor­rected.)

For an in­tro­duc­tion to Paul’s agenda, I’d recom­mend Ajeya Co­tra’s sum­mary. For good prior dis­cus­sion of his agenda, I’d recom­mend Eliezer’s thoughts, Jes­sica Tay­lor’s thoughts (here and here), some posts and dis­cus­sions on LessWrong, and Wei Dai’s com­ments on Paul’s blog. For most of Paul’s writ­ings about his agenda, visit ai-al­ign­ment.com.

0. Goals and non-goals

0.1: What is this agenda try­ing to ac­com­plish?

En­able hu­mans to build ar­bi­trar­ily pow­er­ful AGI as­sis­tants that are com­pet­i­tive with un­al­igned AGI al­ter­na­tives, and only try to help their op­er­a­tors (and in par­tic­u­lar, never at­tempt to kill or ma­nipu­late them).

Peo­ple of­ten con­ceive of safe AGIs as silver bul­lets that will ro­bustly solve ev­ery prob­lem that hu­mans care about. This agenda is not about build­ing a silver bul­let, it’s about build­ing a tool that will safely and sub­stan­tially as­sist its op­er­a­tors. For ex­am­ple, this agenda does not aim to cre­ate as­sis­tants that can do any of the fol­low­ing:

  • They can pre­vent nu­clear wars from happening

  • They can pre­vent evil dic­ta­tor­ships

  • They can make cen­turies’ worth of philo­soph­i­cal progress

  • They can effec­tively ne­go­ti­ate with dis­tant superintelligences

  • They can solve the value speci­fi­ca­tion problem

On the other hand, to the ex­tent that hu­mans care about these things and could make them hap­pen, this agenda lets us build AGI as­sis­tants that can sub­stan­tially as­sist hu­mans achieve these things. For ex­am­ple, a team of 1,000 com­pe­tent hu­mans work­ing to­gether for 10 years could make sub­stan­tial progress on pre­vent­ing nu­clear wars or solv­ing metaphilos­o­phy. Un­for­tu­nately, it’s slow and ex­pen­sive to as­sem­ble a team like this, but an AGI as­sis­tant might en­able us to reap similar benefits in far less time and at much lower cost.

(See Clar­ify­ing “AI Align­ment” and Direc­tions and desider­ata for AI al­ign­ment.)

0.2: What are ex­am­ples of ways in which you imag­ine these AGI as­sis­tants get­ting used?

Two coun­tries end up in an AGI arms race. Both coun­tries are aware of the ex­is­ten­tial threats that AGIs pose, but also don’t want to limit the power of their AIs. They build AGIs ac­cord­ing to this agenda, which stay un­der the op­er­a­tors’ con­trol. Th­ese AGIs then help the op­er­a­tors bro­ker an in­ter­na­tional treaty, which ush­ers in an era of peace and sta­bil­ity. Dur­ing this era, foun­da­tional AI safety prob­lems (e.g. those in MIRI’s re­search agenda) are solved in earnest, and a prov­ably safe re­cur­sively self-im­prov­ing AI is built.

A more pes­simistic sce­nario is that the coun­tries wage war, and the side with the more pow­er­ful AGI achieves a de­ci­sive vic­tory and es­tab­lishes a world gov­ern­ment. This sce­nario isn’t as good, but it at least leaves hu­mans in con­trol (in­stead of ex­tinct).

The most press­ing prob­lem in AI strat­egy is how to stop an AGI race to the bot­tom from kil­ling us all. Paul’s agenda aims to solve this spe­cific as­pect of the prob­lem. That isn’t an ex­is­ten­tial win, but it does rep­re­sent a sub­stan­tial im­prove­ment over the sta­tus quo.

(See sec­tion “2. Com­pet­i­tive” in Direc­tions and desider­ata for AI al­ign­ment.)

0.3: But this might lead to a world dic­ta­tor­ship! Or a world run by philo­soph­i­cally in­com­pe­tent hu­mans who fail to cap­ture most of the pos­si­ble value in our uni­verse! Or some other dystopia!

Sure, maybe. But that’s still bet­ter than a pa­per­clip max­i­mizer kil­ling us all.

There is a so­cial/​poli­ti­cal/​philo­soph­i­cal ques­tion about how to get hu­mans in a post-AGI world to claim a ma­jor­ity of our cos­mic en­dow­ment (in­clud­ing, among other things, not es­tab­lish­ing a tyran­ni­cal dic­ta­tor­ship un­der which in­tel­lec­tual progress halts). While tech­ni­cal AI safety does make progress on this ques­tion, it’s a broader ques­tion over­all that in­vites fairly differ­ent an­gles of at­tack (e.g. policy in­ter­ven­tions and so­cial in­fluence). And, while this ques­tion is ex­tremely im­por­tant, it is a sep­a­rate ques­tion from how you can build ar­bi­trar­ily pow­er­ful AGIs that stay un­der their op­er­a­tors’ con­trol, which is the only ques­tion this agenda is try­ing to an­swer.

1. Alignment

1.1 How do we get al­ign­ment at all?

(“Align­ment” is an im­pre­cise term mean­ing “nice” /​ “not sub­ver­sive” /​ “try­ing to ac­tu­ally help its op­er­a­tor“. See Clar­ify­ing “AI al­ign­ment” for Paul’s de­scrip­tion.)

1.1.1: Isn’t it re­ally hard to give an AI our val­ues? Value learn­ing is re­ally hard, and the de­fault is for it to en­counter in­stru­men­tal in­cen­tives to ma­nipu­late you or pre­vent it­self from get­ting shut down.

The AI isn’t learn­ing our val­ues, it’s learn­ing to op­ti­mize for our short-term ap­proval—in other words, for each ac­tion it takes, it op­ti­mizes for some­thing like what rat­ing we’d give it on a scale from 1 to 5 if we just saw it act.

It’s hard to learn the nu­ances of hu­man val­ues. But from a ma­chine learn­ing per­spec­tive, it’s very easy to learn that hu­mans would strongly dis­ap­prove of at­tempts to kill or ma­nipu­late us. Paul ex­presses this well on his blog:

You need only the vaguest un­der­stand­ing of hu­mans to guess that kil­ling the user is: (1) not some­thing they would ap­prove of, (2) not some­thing they would do, (3) not in line with their in­stru­men­tal prefer­ences.

So in or­der to get bad out­comes here you have to re­ally mess up your model of what hu­mans want (or more likely mess up the un­der­ly­ing frame­work in an im­por­tant way). If we imag­ine a land­scape of pos­si­ble in­ter­pre­ta­tions of hu­man prefer­ences, there is a “right” in­ter­pre­ta­tion that we are shoot­ing for. But if you start with a wrong an­swer that is any­where in the neigh­bor­hood, you will do things like “ask the user what to do, and don’t ma­nipu­late them.” And these be­hav­iors will even­tu­ally get you where you want to go.

(See: Ap­proval-di­rected agents and Act-based agents.)

1.1.2: OK, but doesn’t this only in­cen­tivize it to ap­pear like it’s do­ing what the op­er­a­tor wants? Couldn’t it op­ti­mize for hi­jack­ing its re­ward sig­nal, while seem­ing to act in ways that hu­mans are happy with?

We’re not just train­ing the agent to take good ac­tions. We’re also train­ing it to com­pre­hen­si­bly an­swer ques­tions about why it took the ac­tions it took, to ar­bi­trary lev­els of de­tail. (Imag­ine a metic­u­lous boss grilling an em­ployee about a re­port he put to­gether, or a tax au­di­tor grilling a cor­po­ra­tion about the minu­tiae of its ex­penses.) We en­sure al­ign­ment by ran­domly perform­ing thor­ough eval­u­a­tions of its jus­tifi­ca­tions for its ac­tions, and pun­ish­ing it severely if any of those jus­tifi­ca­tions seem sub­ver­sive. To the ex­tent we trust these jus­tifi­ca­tions to ac­cu­rately re­flect the agent’s cog­ni­tion, we can trust the agent to not act sub­ver­sively (and thus be al­igned).

(See: The in­formed over­sight prob­lem.)

1.1.3: How do we train it to an­swer ques­tions com­pre­hen­sively?

Re­ward it for do­ing so, and pun­ish it for failing to do so.

Imag­ine be­ing a metic­u­lous boss who asks his em­ployee to put to­gether a re­port. Imag­ine grilling him about the re­port, and pun­ish­ing him ev­ery time he fails to an­swer your ques­tions clearly enough or at a satis­fac­tory level of de­tail, in ad­di­tion to pun­ish­ing him for low-qual­ity re­ports. If your em­ployee is ca­pa­ble enough, he’ll even­tu­ally learn to pro­duce high-qual­ity re­ports and an­swer ques­tions satis­fac­to­rily when you grill him.

Note that the abil­ity to gen­er­ate com­pre­hen­si­ble de­scrip­tions of ar­bi­trary cog­ni­tion is a ma­jor open prob­lem in Paul’s agenda, and also a ma­jor prob­lem in AI safety broadly. It’s the part of his agenda that Paul is cur­rently fo­cus­ing most on.

(See The in­formed over­sight prob­lem, Learn­ing rep­re­sen­ta­tions, and Ap­proval-max­imis­ing rep­re­sen­ta­tions.)

1.1.4: Why should we ex­pect the agent’s an­swers to cor­re­spond to its cog­ni­tion at all?

We don’t ac­tu­ally have any guaran­tees that it does, but giv­ing hon­est an­swers is prob­a­bly the eas­iest way for the agent to max­i­mize its re­ward. The only fea­si­ble al­ter­na­tive is giv­ing thor­ough dishon­est ex­pla­na­tions for its ac­tions. But if an em­ployee lies about how he ar­rived at his con­clu­sions, his ex­pla­na­tions won’t check out, and he might get pun­ished severely. And if a cor­po­ra­tion lies about how it spent its money, the num­bers won’t add up, and its au­di­tor might pun­ish it severely.

It’s crit­i­cally im­por­tant for this scheme that the over­seer is ca­pa­ble of eval­u­at­ing the as­sis­tant’s jus­tifi­ca­tions ex­tremely thor­oughly, and does so of­ten enough. Cor­po­ra­tions cook their books in pro­por­tion to how thor­ough their au­di­tors are, how com­mon au­dits are, and how bad it is to get caught by their au­di­tors. If we au­dit thor­oughly enough and fre­quently enough, and pun­ish our as­sis­tants severely enough for sub­ver­sion, we should ex­pect them to an­swer hon­estly.

1.1.5: Wouldn’t this in­cen­tivize the as­sis­tant to pro­duce jus­tifi­ca­tions that seem fine and in­no­cent to you, but may ac­tu­ally be harm­ful?

We should ex­pect the AI not to ad­ver­sar­i­ally ma­nipu­late us—by as­sump­tion, we’re eval­u­at­ing its jus­tifi­ca­tions thor­oughly enough that we should be able to catch any sub­ver­sive cog­ni­tion.

It’s pos­si­ble that there are free vari­ables in what sorts of cog­ni­tion the op­er­a­tor deems ac­cept­able, and that a non-ad­ver­sar­ial op­ti­miza­tion pro­cess might be able to per­suade the op­er­a­tor of es­sen­tially ar­bi­trary con­clu­sions by ma­nipu­lat­ing these free vari­ables in ways the op­er­a­tors ap­proves of. For ex­am­ple, an AGI as­sis­tant might ac­ci­den­tally per­suade you to be­come an ISIS suicide bomber, while only think­ing in ways that you ap­prove of.

I do think this is a po­ten­tially se­vere prob­lem. But I don’t con­sider it a dealbreaker, for a num­ber of rea­sons:

  • An AGI as­sis­tant “ac­ci­den­tally” ma­nipu­lat­ing you is no differ­ent from a very smart and ca­pa­ble hu­man as­sis­tant who, in the pro­cess of as­sist­ing you, causes you to be­lieve dras­tic and sur­pris­ing con­clu­sions. Even if this might lead to bad out­comes, Paul isn’t aiming for his agenda to pre­vent this class of bad out­comes.

  • The more ra­tio­nal you are, the smaller the space of con­clu­sions you can be non-ad­ver­sar­i­ally led into be­liev­ing. (For ex­am­ple, it’s very hard for me to imag­ine my­self get­ting per­suaded into be­com­ing an ISIS suicide bomber by a pro­cess whose cog­ni­tion I ap­prove of.) It might be that some hu­mans have passed a ra­tio­nal­ity thresh­old, such that they only end up be­liev­ing cor­rect con­clu­sions af­ter think­ing for a long time with­out ad­ver­sar­ial pres­sures.

1.2 Am­plify­ing and dis­till­ing alignment

1.2.1: OK, you pro­pose that to am­plify some al­igned agent, you just run it for a lot longer, or run way more of them and have them work to­gether. I can buy that our ini­tial agent is al­igned; why should I trust their ag­gre­gate to be al­igned?

When al­igned agents work to­gether, there’s of­ten emer­gent be­hav­ior that can be de­scribed as non-al­igned. For ex­am­ple, if the op­er­a­tor is pur­su­ing a goal (like in­creas­ing Youtube’s rev­enue), one group of agents pro­poses a sub­goal (like in­creas­ing Youtube views), and an­other group com­pe­tently pur­sues that sub­goal with­out un­der­stand­ing how it re­lates to the top-level goal (e.g. by triple-count­ing all the views), you end up with mis­al­igned op­ti­miza­tion. As an­other ex­am­ple, there might be some in­put (e.g. some weirdly com­pel­ling ar­gu­ment) that causes some group of al­igned agents to “go in­sane” and be­have un­pre­dictably, or op­ti­mize for some­thing against the op­er­a­tor’s wishes.

Two ap­proaches that Paul con­sid­ers im­por­tant for pre­serv­ing al­ign­ment:

  • Reli­a­bil­ity am­plifi­ca­tion—ag­gre­gat­ing agents that can an­swer a ques­tion cor­rectly some of the time (say, 80% of the time) in a way that they can an­swer ques­tions cor­rectly with ar­bi­trar­ily high prob­a­bil­ity.

  • Se­cu­rity am­plifi­ca­tion—win­now­ing down the set of queries that, when fed to the ag­gre­gate, causes the ag­gre­gate to “go in­sane”.

It re­mains an open ques­tion in Paul’s agenda how al­ign­ment can be ro­bustly pre­served through ca­pa­bil­ity am­plifi­ca­tion—in other words, how to in­crease the ca­pa­bil­ities of al­igned agents with­out in­tro­duc­ing mis­al­igned be­hav­ior.

(See: Ca­pa­bil­ity am­plifi­ca­tion, Reli­a­bil­ity am­plifi­ca­tion, Se­cu­rity am­plifi­ca­tion, Univer­sal­ity and se­cu­rity am­plifi­ca­tion, and Two guaran­tees.)

1.2.2: OK, so given this am­plified al­igned agent, how do you get the dis­til­led agent?

Train a new agent via some com­bi­na­tion of imi­ta­tion learn­ing (pre­dict­ing the ac­tions of the am­plified al­igned agent), semi-su­per­vised re­in­force­ment learn­ing (where the am­plified al­igned agent helps spec­ify the re­ward), and tech­niques for op­ti­miz­ing ro­bust­ness (e.g. cre­at­ing red teams that gen­er­ate sce­nar­ios that in­cen­tivize sub­ver­sion).

(See: RL+Imi­ta­tion, Benign model-free RL, Semi-su­per­vised re­in­force­ment learn­ing, and Tech­niques for op­ti­mi­siz­ing worst-case perfor­mance.)

1.2.3: It seems like imi­ta­tion learn­ing might cause a lot of minu­tiae to get lost, and would cre­ate some­thing that’s “mostly al­igned” but ac­tu­ally not al­igned in a bunch of sub­tle ways. Maybe this is tol­er­able for one round of iter­a­tion, but af­ter 100 rounds, I wouldn’t feel very good about the al­ign­ment of the re­sult­ing agent...

In­deed, which is why this new agent is also trained with semi-su­per­vised re­in­force­ment learn­ing, and op­ti­mized for ro­bust­ness. The imi­ta­tion learn­ing is more about get­ting this new agent off the ground than about en­sur­ing al­ign­ment. The bulk of the al­ign­ment guaran­tee comes from the semi-su­per­vised re­in­force­ment learn­ing, where we train it to work on a wide range of tasks and an­swer ques­tions about its cog­ni­tion. We trust it to be al­igned be­cause we train it the same way we trained our origi­nal agent to be al­igned—namely, by re­ward­ing it for max­i­miz­ing our ap­proval, and pun­ish­ing it for think­ing sub­ver­sive thoughts.

(There’s a com­mon mis­con­cep­tion that Paul’s agenda pro­ceeds en­tirely by tak­ing a sin­gle al­igned as­sis­tant, run­ning it for a lot longer, us­ing imi­ta­tion learn­ing to dis­till that into a new as­sis­tant, run­ning that new as­sis­tant a lot longer, us­ing imi­ta­tion learn­ing to dis­till that, etc., much like how AlphaGo Zero im­proves its policy net­work. While this does cap­ture a key idea be­hind how the as­sis­tant’s ca­pa­bil­ities get am­plified, it omits crit­i­cal com­po­nents of his agenda that en­sure al­ign­ment, like in­spect­ing cog­ni­tion and am­plify­ing ro­bust­ness.)

1.2.4: This dis­til­led agent’s cog­ni­tion might be much more com­plex than that of our pre­vi­ous agent. How can we rea­son­ably de­ter­mine whether we’d ap­prove of its cog­ni­tion?

This is where our as­sis­tants come in. While the hu­man op­er­a­tor does in­spect some as­pects of the dis­til­led agent’s cog­ni­tion, most of the in­spec­tion is done by our as­sis­tants, whose judg­ments we trust be­cause we trust that they’re al­igned.

(See The re­ward en­g­ineer­ing prob­lem.)

1.2.5: Given that this dis­til­led agent is way more pow­er­ful than you and your al­igned agents, couldn’t it ma­nipu­late you and your team of al­igned agents?

If we’re am­plify­ing and dis­till­ing agent A, we’re ac­tu­ally train­ing our newly dis­til­led agent to be stronger than A, but much weaker than the am­plified ver­sion of A. The am­plifi­ca­tion/​dis­til­la­tion pro­ce­dure can be thought of as re­peat­edly tak­ing two steps for­ward (in am­plify­ing A) and one step back (in dis­till­ing a new agent).

Any sub­ver­sive plan the dis­til­led agent makes could get rec­og­nized as sub­ver­sive by the am­plified agent, be­cause the am­plified agent has enough power to thor­oughly eval­u­ate ev­ery as­pect of the cog­ni­tion that went into the plan. The am­plified agent rarely performs thor­ough eval­u­a­tions of the dis­til­led agent’s cog­ni­tion, be­cause it’s ex­tremely ex­pen­sive, but does so ran­domly and pun­ishes the dis­til­led agent ex­tremely harshly if it de­tects sub­ver­sion. The dis­til­led agent, know­ing that it will have to thor­oughly jus­tify its ac­tions to a com­pe­tent su­per­vi­sor, is thus strongly in­cen­tivized against act­ing sub­ver­sively.

In other words, the am­plified agent ran­domly “au­dits” the dis­til­led agent, and pun­ishes the dis­til­led agent very harshly if it fails the au­dit. Though the dis­til­led agent knows that it might be able to de­ceive its su­per­vi­sor when it isn’t au­dited, it’s so scared of the out­come where it tries to do that and gets au­dited that it doesn’t even want to try. (Even if you were 99% con­fi­dent that you could get away with tax eva­sion, you wouldn’t want to try if you knew the gov­ern­ment tor­tures and mur­ders the fam­i­lies of the tax evaders they catch.)

(See Benign model-free RL, the com­ments in The in­formed over­sight prob­lem, and William Saun­ders’ post Im­prob­a­ble Over­sight, An At­tempt at In­formed Over­sight.)

1.3 Ro­bust al­ign­ment /​ corrigibility

1.3.1: If you del­e­gate over­sight to a well-mean­ing as­sis­tant (“hey, can you make sure things are go­ing as planned?”), who del­e­gates over­sight to a well-mean­ing as­sis­tant, who del­e­gates over­sight to a well-mean­ing as­sis­tant, etc., then the de­fault is for the del­e­ga­tee 100 lay­ers down to end up with some ex­tremely dis­torted ver­sion of what you care about. Shouldn’t we ex­pect this dis­tor­tion to hap­pen with the agents we pro­duce?

Yes, which is why al­ign­ment isn’t the prop­erty we’re ul­ti­mately aiming to in­duc­tively guaran­tee. The prop­erty we’re try­ing to guaran­tee is some­thing closer to “al­ign­ment + ex­treme cau­tion about whether it’s al­igned and cau­tious enough”. Paul usu­ally refers to this as cor­rigi­bil­ity.

This way, ev­ery time we’re train­ing a dis­til­led agent, we train it to want to clar­ify with its over­seer (i.e., us as­sisted with a team of cor­rigible as­sis­tants) when­ever it’s un­cer­tain about what we would ap­prove of. Our cor­rigible as­sis­tants ei­ther an­swer the ques­tion con­fi­dently, or clar­ify with us if it’s un­cer­tain about its an­swer.

1.3.2: OK, but doesn’t this just pass the buck onto cor­rigi­bil­ity? In the same way that al­ign­ment can get dis­torted, couldn’t cor­rigi­bil­ity also get dis­torted?

Paul has two strong in­tu­itions about cor­rigi­bil­ity:

  1. cor­rigi­bil­ity has a sim­ple core that should be easy for hu­mans to im­part into pow­er­ful AI systems

  2. cor­rigi­bil­ity is ro­bust—agents that are some­what cor­rigible want to be­come more cor­rigible.

On this view, even if hu­man ap­proval is high-com­plex­ity and hard to get right, cor­rigi­bil­ity is straight­for­ward to learn and most likely to im­prove over dis­til­la­tion rather than worsen.

You also need ro­bust­ness am­plifi­ca­tion to en­sure that cor­rigi­bil­ity is main­tained. For ex­am­ple, a cor­rigible su­per­vi­sor needs to be able to thor­oughly eval­u­ate a dis­til­led agent’s cog­ni­tion, but if its eval­u­a­tions aren’t re­li­able, it can’t be trusted to en­sure that the dis­til­led agent is cor­rigible (and not a dae­mon). And a cor­rigible agent that “goes in­sane” doesn’t stay cor­rigible.

(See: Cor­rigi­bil­ity.)

1.3.3: I don’t share those in­tu­itions around cor­rigi­bil­ity. Do you have any in­tu­ition pumps?

One in­tu­ition pump: cor­rigi­bil­ity can be thought of as ex­treme cau­tion about whether you’re ac­tu­ally be­ing helpful, and ex­treme cau­tion is ro­bust—if you’re ex­tremely cau­tious about how things can go wrong, you want to know more ways things can go wrong and you want to im­prove your abil­ity to spot how things are go­ing wrong, which will lead you to be­come more cau­tious.

Another in­tu­ition pump: I have some in­tu­itive con­cept of “epistem­i­cally cor­rigible hu­mans”. Some things that ges­ture at this con­cept:

  • They care deeply about find­ing the truth, and im­prov­ing their skill at find­ing the truth.

  • They’re aware that they’re flawed rea­son­ers, with bi­ases and blind spots, and ac­tively seek out ways to no­tice and re­move these flaws. They try to take ideas se­ri­ously, no mat­ter how weird they seem.

  • Their be­liefs tend to be­come more true over time.

  • Their skill at hav­ing true be­liefs im­proves over time.

  • They tend to reach similar con­clu­sions in the limit (namely, the cor­rect ones), even if they’re ex­tremely weird and not broadly ac­cepted.

I think of cor­rigible as­sis­tants as be­ing cor­rigible in the above way, ex­cept op­ti­miz­ing for helping its op­er­a­tor in­stead of find­ing the truth. Im­por­tantly, so long as an agent crosses some thresh­old of cor­rigi­bil­ity, they will want to be­come more and more cau­tious about whether they’re helpful, which is where ro­bust­ness comes from.

Given that cor­rigi­bil­ity seems like a prop­erty that any rea­soner could have (and not just hu­mans), it’s prob­a­bly not too com­pli­cated a con­cept for a pow­er­ful AI sys­tem to learn, es­pe­cially given that many hu­mans seem able to learn some ver­sion of it.

1.3.4: This cor­rigi­bil­ity thing still seems re­ally fishy. It feels like you just gave some clever ar­gu­ments about some­thing very fuzzy and hand­wavy, and I don’t feel com­fortable trust­ing that.

While Paul thinks there’s a good in­tu­itive case for some­thing like cor­rigi­bil­ity, he also con­sid­ers get­ting a deeper con­cep­tual un­der­stand­ing of cor­rigi­bil­ity one of the most im­por­tant re­search di­rec­tions for his agenda. He agrees it’s pos­si­ble that cor­rigi­bil­ity may not be safely learn­able, or not ac­tu­ally ro­bust, in which case he’d feel way more pes­simistic about his en­tire agenda.

2. Usefulness

2.1. Can the sys­tem be both safe and use­ful?

2.1.1: A lot of my val­ues and knowl­edge are im­plicit. Why should I trust my as­sis­tant to be able to learn my val­ues well enough to as­sist me?

Imag­ine a ques­tion-an­swer­ing sys­tem trained on all the data on Wikipe­dia, that ends up with com­pre­hen­sive, gears-level world-mod­els, which it can use to syn­the­size ex­ist­ing in­for­ma­tion to an­swer novel ques­tions about so­cial in­ter­ac­tions or what our phys­i­cal world is like. (Think Wolfram|Alpha, but much bet­ter.)

This sys­tem is some­thing like a proto-AGI. We can eas­ily re­strict it (for ex­am­ple by limit­ing how long it gets to re­flect when it an­swers ques­tions) so that we can train it to be cor­rigible while trust­ing that it’s too limited to do any­thing dan­ger­ous that the over­seer couldn’t rec­og­nize as dan­ger­ous. We use such a re­stricted sys­tem to start off the iter­ated dis­til­la­tion and am­plifi­ca­tion pro­cess, and boot­strap it to get sys­tems of ar­bi­trar­ily high ca­pa­bil­ities.

(See: Au­to­mated as­sis­tants)

2.1.2: OK, sure, but it’ll es­sen­tially still be an alien and get lots of minu­tiae about our val­ues wrong.

How bad is it re­ally if it gets minu­tiae wrong, as long as it doesn’t cause ma­jor catas­tro­phes? Ma­jor catas­tro­phes (like nu­clear wars) are pretty ob­vi­ous, and we would ob­vi­ously dis­ap­prove of ac­tions that lead us to catas­tro­phe. So long as it learns to avoid those (which it will, if we give it the right train­ing data), we’re fine.

Also keep in mind that we’re train­ing it to be cor­rigible, which means it’ll be very cau­tious about what sorts of things we’d con­sider catas­trophic, and try very hard to avoid them.

2.1.3: But it might make lots of sub­tle mis­takes that add up to some­thing catas­trophic!

And so might we. Maybe there are some classes of sub­tle mis­takes the AI will be more prone to than we are, but there are prob­a­bly also classes of sub­tle mis­takes we’ll be more prone to than the AI. We’re only shoot­ing for our as­sis­tant to avoid try­ing to lead us to a catas­trophic out­come.

(See: Tech­niques for op­ti­miz­ing worst-case perfor­mance.)

2.1.4: I’m re­ally not sold that train­ing it to avoid catas­tro­phes and train­ing it to be cor­rigible will be good enough.

This is ac­tu­ally more a ca­pa­bil­ities ques­tion (is our sys­tem good enough at try­ing very hard to avoid catas­tro­phes to ac­tu­ally avoid a catas­tro­phe?) than an al­ign­ment ques­tion. A ma­jor open ques­tion in Paul’s agenda is how we can for­mal­ize perfor­mance guaran­tees well enough to state ac­tual worst-case guaran­tees.

(See: Two guaran­tees and Tech­niques for op­ti­miz­ing worst-case perfor­mance)

2.2. Universality

2.2.1. What sorts of cog­ni­tion will our as­sis­tants be able to perform?

We should roughly ex­pect it to think in ways that would be ap­proved by an HCH (short for “hu­man con­sult­ing HCH”). To de­scribe HCHs, let me start by de­scribing a weak HCH:

Con­sider a hu­man Hugh who has ac­cess to a ques­tion-an­swer­ing ma­chine. Sup­pose the ma­chine an­swers ques­tion Q by perfectly imi­tat­ing how Hugh would an­swer ques­tion Q, if Hugh had ac­cess to the ques­tion-an­swer­ing ma­chine.

That is, Hugh is able to con­sult a copy of Hugh, who is able to con­sult a copy of Hugh, who is able to con­sult a copy of Hugh…

I some­times pic­ture this as an in­finite tree of hu­mans-in-boxes, who can break down ques­tions and pass them to other hu­mans-in-boxes (who can break down those ques­tions and pass them along to other hu­mans-in-boxes, etc.) and get back an­swers in­stan­ta­neously. A few re­marks:

  • This for­mal­ism tries to cap­ture some no­tion of “what would H think about some topic if H thought about it for ar­bi­trar­ily long amounts of time”? For ex­am­ple, H might make par­tial progress on some ques­tion, and then share this progress with some other H and ask it to make more progress, who might do the same.

  • A weak HCH could simu­late the cog­ni­tive la­bor of an econ­omy the size of the US econ­omy. After all, a weak HCH can em­u­late a sin­gle hu­man think­ing for a long time, so it can em­u­late teams of hu­mans think­ing for a long time, and thus teams of teams of hu­mans think­ing for a long time, etc. If you imag­ine a cor­po­ra­tion as teams of teams of teams of hu­mans perform­ing cog­ni­tive la­bor, you get that a weak HCH can em­u­late the out­put of an ar­bi­trary cor­po­ra­tion, and thus col­lec­tions of ar­bi­trary cor­po­ra­tions com­mu­ni­cat­ing with one an­other.

  • Many tasks that don’t in­tu­itively seem like they can be bro­ken down, can in fact be fairly sub­stan­tially bro­ken down. For ex­am­ple, mak­ing progress on difficult math prob­lems seems difficult to break down. But you could break down progress on a math prob­lem into some­thing like (think for a while about pos­si­ble an­gles of at­tack) + (try each an­gle of at­tack, and re­curse on the new math prob­lem). And (think for a while about pos­si­ble an­gles of at­tack) can be re­duced into (look at fea­tures of this prob­lem and see if you’ve solved any­thing similar), which can be re­duced into fo­cus­ing on spe­cific fea­tures, and so on.

Strong HCH, or just HCH, is a var­i­ant of weak HCHs where the agents-in-boxes are able to com­mu­ni­cate with each other di­rectly, and read and write to some shared ex­ter­nal mem­ory, in ad­di­tion to be­ing able to ask, an­swer, and break down ques­tions. Note that they would be able to im­ple­ment ar­bi­trary Tur­ing ma­chines this way, and thus avoid any limits on cog­ni­tion im­posed by the struc­ture of weak HCH.

(Note: most peo­ple think “HCH” refers to “weak HCH”, but when­ever Paul men­tions HCHs, he now refers to strong HCHs.)

The ex­act re­la­tion­ship be­tween HCH and the agents pro­duced through iter­ated am­plifi­ca­tion and dis­til­la­tion is con­fus­ing and very com­monly mi­s­un­der­stood:

  • HCHs should not be vi­su­al­ized as hav­ing hu­mans in the box. They should be thought of as hav­ing some cor­rigible as­sis­tant in­side the box, much like the ques­tion-an­swer­ing sys­tem de­scribed in 2.1.1.

  • Through­out the iter­ated am­plifi­ca­tion and dis­til­la­tion pro­cess, there is never any agent whose cog­ni­tion re­sem­bles an HCH of the cor­rigible as­sis­tant. In par­tic­u­lar, agents pro­duced via dis­til­la­tion are gen­eral RL agents with no HCH-like con­straints on their cog­ni­tion. The clos­est re­sem­blance to HCH ap­pears dur­ing am­plifi­ca­tion, dur­ing which a su­per­a­gent (formed out of copies of the agent get­ting am­plified) performs tasks by break­ing them down and dis­tribut­ing them among the agent copies.

(As of the time of this writ­ing, I am still con­fused about the sense in which the agent’s cog­ni­tion is ap­proved by an HCH, and what that means about the agent’s ca­pa­bil­ities.)

(See: Hu­mans con­sult­ing HCH and Strong HCH.)

2.2.2. Why should I think the HCH of some sim­ple ques­tion-an­swer­ing AI as­sis­tant can perform ar­bi­trar­ily com­plex cog­ni­tion?

All difficult and cre­ative in­sights stem from chains of smaller and eas­ier in­sights. So long as our first AI as­sis­tant is a uni­ver­sal rea­soner (i.e., it can im­ple­ment ar­bi­trary Tur­ing ma­chines via re­flec­tion), it should be able to re­al­ize ar­bi­trar­ily com­plex things if it re­flects for long enough. For illus­tra­tion, Paul thinks that chimps aren’t uni­ver­sal rea­son­ers, and that most hu­mans past some in­tel­li­gence thresh­old are uni­ver­sal.

If this seems coun­ter­in­tu­itive, I’d claim it’s be­cause we have poor in­tu­itions around what’s achiev­able with 2,000,000,000 years of re­flec­tion. For ex­am­ple, it might seem that an IQ 120 per­son, know­ing no math be­yond ar­ith­metic, would sim­ply be un­able to prove Fer­mat’s last the­o­rem given ar­bi­trary amounts of time. But if you buy that:

  • An IQ 180 per­son could, in 2,000 years, prove Fer­mat’s last the­o­rem know­ing noth­ing but ar­ith­metic (which seems fea­si­ble, given that most math­e­mat­i­cal progress was made by peo­ple with IQs un­der 180)

  • An IQ 160 per­son could, in 100 years, make the in­tel­lec­tual progress an IQ 180 per­son could in 1 year

  • An IQ 140 per­son could, in 100 years, make the in­tel­lec­tual progress an IQ 160 per­son could in 1 year

  • An IQ 120 per­son could, in 100 years, make the in­tel­lec­tual progress an IQ 140 per­son could in 1 year

Then it fol­lows that an IQ 120 per­son could prove Fer­mat’s last the­o­rem in 2,000*100*100*100 = 2,000,000,000 years’ worth of re­flec­tion.

(See: Of hu­mans and uni­ver­sal­ity thresh­olds.)

2.2.3. Differ­ent rea­son­ers can rea­son in very differ­ent ways and reach very differ­ent con­clu­sions. Why should I ex­pect my am­plified as­sis­tant to rea­son any­thing like me, or reach con­clu­sions that I’d have reached?

You shouldn’t ex­pect it to rea­son any­thing like you, you shouldn’t ex­pect it to reach the con­clu­sions you’d reach, and you shouldn’t ex­pect it to re­al­ize ev­ery­thing you’d con­sider ob­vi­ous (just like you wouldn’t re­al­ize ev­ery­thing it would con­sider ob­vi­ous). You should ex­pect it to rea­son in ways you ap­prove of, which should con­strain its rea­son­ing to be sen­si­ble and com­pe­tent, as far as you can tell.

The goal isn’t to have an as­sis­tant that can think like you or re­al­ize ev­ery­thing you’d re­al­ize. The goal is to have an as­sis­tant who can think in ways that you con­sider safe and sub­stan­tially helpful.

2.2.4. HCH seems to de­pend crit­i­cally on be­ing able to break down ar­bi­trary tasks into sub­tasks. I don’t un­der­stand how you can break down tasks that are largely in­tu­itive or per­cep­tual, like play­ing Go very well, or rec­og­niz­ing images.

Go is ac­tu­ally fairly straight­for­ward: an HCH can just perform an ex­po­nen­tial tree search. Iter­ated am­plifi­ca­tion and dis­til­la­tion ap­plied to Go is not ac­tu­ally that differ­ent from how AlphaZero trains to play Go.

Image recog­ni­tion is harder, but to the ex­tent that hu­mans have clear con­cepts of vi­sual fea­tures they can refer­ence within images, the HCH should be able to fo­cus on those fea­tures. The cat vs. dog de­bate in Ge­offrey Irv­ing’s ap­proach to AI safety via de­bate gives some illus­tra­tion of this.

Things get par­tic­u­larly tricky when hu­mans are faced with a task they have lit­tle ex­plicit knowl­edge about, like trans­lat­ing sen­tences be­tween lan­guages. Paul did men­tion some­thing like “at some point, you’ll prob­a­bly just have to stick with rely­ing on some brute statis­ti­cal reg­u­lar­ity, and just use the heuris­tic that X com­monly leads to Y, with­out be­ing able to break it down fur­ther”.

(See: Wei Dai’s com­ment on Can Cor­rigi­bil­ity be Learned Safely, and Paul’s re­sponse to a differ­ent com­ment by Wei Dai on the topic.)

2.2.5: What about tasks that re­quire sig­nifi­cant ac­cu­mu­la­tion of knowl­edge? For ex­am­ple, how would the HCH of a hu­man who doesn’t know calcu­lus figure out how to build a rocket?

This sounds difficult for weak HCHs on their own to over­come, but pos­si­ble for strong HCHs to over­come. The ac­cu­mu­lated knowl­edge would be rep­re­sented in the strong HCHs shared ex­ter­nal mem­ory, and the hu­mans es­sen­tially act as “work­ers” im­ple­ment­ing a higher-level cog­ni­tive sys­tem, much like ants in an ant colony. (I’m still some­what con­fused about what the de­tails of this would en­tail, and am in­ter­ested in see­ing a more fleshed out im­ple­men­ta­tion.)

2.2.6: It seems like this ca­pac­ity to break tasks into sub­tasks is pretty sub­tle. How does the AI learn to do this? And how do we find hu­man op­er­a­tors (be­sides Paul) who are ca­pa­ble of do­ing this?

Ought is gath­er­ing em­piri­cal data about task de­com­po­si­tion. If that proves suc­cess­ful, Ought will have nu­mer­ous pub­li­cly available ex­am­ples of hu­mans break­ing down tasks.

3. State of the agenda

3.1: What are the cur­rent ma­jor open prob­lems in Paul’s agenda?

The most im­por­tant open prob­lems in Paul’s agenda, ac­cord­ing to Paul:

  • Worst-case guaran­tees: How can we make worst-case guaran­tees about the re­li­a­bil­ity and se­cu­rity of our as­sis­tants? For ex­am­ple, how can we en­sure our over­sight is re­li­able enough to pre­vent the cre­ation of sub­ver­sive sub­agents (a.k.a. dae­mons) in the dis­til­la­tion pro­cess that cause our over­all agent to be sub­ver­sive?

  • Trans­par­ent cog­ni­tion: How can we ex­tract use­ful in­for­ma­tion from ML sys­tems’ cog­ni­tion? (E.g. what con­cepts are rep­re­sented in them, what log­i­cal facts are em­bed­ded in them, and what statis­ti­cal reg­u­lar­i­ties about the data it cap­tures.)

  • For­mal­iz­ing cor­rigi­bil­ity: Can we for­mal­ize cor­rigi­bil­ity to the point that we can cre­ate agents that are know­ably ro­bustly cor­rigible? For ex­am­ple, could we for­mal­ize cor­rigi­bil­ity, use that for­mal­iza­tion to prove the ex­is­tence of a broad basin of cor­rigi­bil­ity, and then prove that ML sys­tems past some low thresh­old will land and stay in this basin?

  • Aligned ca­pa­bil­ity am­plifi­ca­tion: Can we perform am­plifi­ca­tion in a way that doesn’t in­tro­duce al­ign­ment failures? In par­tic­u­lar, can we safely de­com­pose ev­ery task we care about with­out effec­tively im­ple­ment­ing an al­igned AGI built out of hu­man tran­sis­tors?

(See: Two guaran­tees, The in­formed over­sight prob­lem, Cor­rigi­bil­ity, and the “Low Band­width Overseer” sec­tion of William Saun­der’s post Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion: Claims and Over­sight.)

3.2: How close to com­ple­tion is Paul’s re­search agenda?

Not very close. For all we know, these prob­lems might be ex­traor­di­nar­ily difficult. For ex­am­ple, a sub­prob­lem of “trans­par­ent cog­ni­tion” is “how can hu­mans un­der­stand what goes on in­side neu­ral nets”, which is a broad open ques­tion in ML. Subprob­lems of “worst-case guaran­tees” in­clude en­sur­ing that ML sys­tems are ro­bust to dis­tri­bu­tional shift and ad­ver­sar­ial in­puts, which are also broad open ques­tions in ML, and which might re­quire sub­stan­tial progress on MIRI-style re­search to ar­tic­u­late and prove for­mal bounds. And get­ting a for­mal­iza­tion of cor­rigi­bil­ity might re­quire for­mal­iz­ing as­pects of good rea­son­ing (like cal­ibra­tion about un­cer­tainty), which might in turn re­quire sub­stan­tial progress on MIRI-style re­search.

I think peo­ple com­monly con­flate “Paul has a safety agenda he feels op­ti­mistic about” with “Paul thinks he has a solu­tion to AI al­ign­ment”. Paul in fact feels op­ti­mistic about these prob­lems get­ting solved well enough for his agenda to work, but does not con­sider his re­search agenda any­thing close to com­plete.

(See: Univer­sal­ity and se­cu­rity am­plifi­ca­tion, search “MIRI”)


Thanks to Paul Chris­ti­ano, Ryan Carey, David Krueger, Ro­hin Shah, Eric Rogstad, and Eli Tyre for helpful sug­ges­tions and feed­back.