Understanding Iterated Distillation and Amplification: Claims and Oversight

[Back­ground: In­tended for an au­di­ence that has some fa­mil­iar­ity with Paul Chris­ti­ano’s ap­proach to AI Align­ment. Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion should provide suffi­cient back­ground.]

[Dis­claimer: When I talk about “what Paul claims”, I am only sum­ma­riz­ing what I think he means through read­ing his blog and par­ti­ci­pat­ing on dis­cus­sions on his posts. I could be mis­taken/​mis­lead­ing in these claims]

I’ve re­cently up­dated my men­tal model of how Paul Chris­ti­ano’s ap­proach to AI al­ign­ment works, based on re­cent blog posts and dis­cus­sions around them (in which I found Wei Dai’s com­ments par­tic­u­larly use­ful). I think that the up­date that I made might be easy to miss if you haven’t read the right posts/​com­ments, so I think it’s use­ful to lay it out here. I cover two parts: un­der­stand­ing the limits on what Paul’s ap­proach claims to ac­com­plish, and un­der­stand­ing the role of the over­seer in Paul’s ap­proach. Th­ese con­sid­er­a­tions are im­por­tant to un­der­stand if you’re try­ing to eval­u­ate how likely this ap­proach is to work, or try­ing to make tech­ni­cal progress on it.

What does Paul’s ap­proach claim to ac­com­plish?

First, it’s im­por­tant to un­der­stand that what “Paul’s ap­proach to AI al­ign­ment” claims to ac­com­plish if it were car­ried out. The term “ap­proach to AI al­ign­ment” can sound like it means “recipe for build­ing a su­per­in­tel­li­gence that safely solves all of your prob­lems”, but this is not how Paul in­tends to use this term. Paul goes into this in more de­tail in Clar­ify­ing “AI al­ign­ment”.

A rough sum­mary is that his ap­proach will only build an agent that is as ca­pa­ble as some known un­al­igned ma­chine learn­ing al­gorithm.

He does not claim that the end re­sult of his ap­proach is an agent that:

  • Can di­rectly solve all prob­lems which can be solved by a human

  • Will never take an un­safe catas­trophic action

  • Will never take an ac­tion based on a mi­s­un­der­stand­ing your com­mands or your values

  • Could safely de­sign suc­ces­sor agents or self-improve

  • Will have higher ca­pa­bil­ity than an un­al­igned competitor

It’s im­por­tant to un­der­stand the limits of what Paul’s ap­proach claims in or­der to un­der­stand what it would ac­com­plish, and the strate­gic situ­a­tion that would re­sult.

What is the Overseer?

Iter­ated Distil­la­tion and Am­plifi­ca­tion (IDA) de­scribes a pro­ce­dure that tries to take an over­seer and pro­duce an agent that does what the over­seer would want it to do, with a rea­son­able amount of train­ing over­head. “what the over­seer would want it to do” is defined by re­peat­ing the am­plifi­ca­tion pro­ce­dure. The post refers to am­plifi­ca­tion as the over­seer us­ing a num­ber of ma­chine learned as­sis­tants to solve prob­lems. We can bound what IDA could ac­com­plish by think­ing about what the over­seer could do if it could del­e­gate to a num­ber of copies of it­self to solve prob­lems (for a hu­man over­seer, this cor­re­sponds to HCH). To un­der­stand what this ap­proach can ac­com­plish, it’s im­por­tant to un­der­stand what the over­seer is do­ing. I think there are two differ­ent mod­els of the over­seer that could be in­ferred from differ­ent parts of the dis­cus­sion around Paul’s work, which I la­bel high band­width over­sight and low band­width over­sight.

High Band­width Overseer

The im­pres­sion that I get from many of Paul’s posts is that the over­seer is:

A high band­width over­seer is a hu­man that takes in an in­put and has some amount of time (be­tween 15 min­utes and a day) to pro­cess it. The hu­man can look at as much of the in­put as it can within the al­lowed time, and when it del­e­gates a sub-task it can look at the re­sults. The main re­quire­ment for a hu­man act­ing as an over­seer is that they are try­ing to be helpful. The over­seer can take in ar­bi­trary nat­u­ral-lan­guage re­quests and at­tempt to fulfill them.

The main ques­tion that comes to mind con­sid­er­ing this model of the over­seer is the prob­lem of com­pe­tent task de­com­po­si­tion:

Can we break down tasks into rea­son­ably sized pieces (ie. can you solve a math prob­lem from a text­book in 15 min­utes by del­e­gat­ing to a num­ber of peo­ple who don’t un­der­stand the ma­te­rial and only have 15 min­utes each to in­ter­act with the text­book)?

This high band­width over­sight model can be eas­ier to use when try­ing to get a rough idea of how over­sight works in or­der to un­der­stand how IDA works as a whole.

Low Band­width Overseer

How­ever, look­ing at Univer­sal­ity and Se­cu­rity Am­plifi­ca­tion I found a differ­ent pic­ture of what the over­seer looks like:

A low band­width over­seer is only al­lowed to pro­cess in­puts from a finite in­put set. This in­cludes both the origi­nal in­put query, and the re­sults of sub­queries. The over­seer can pass in­for­ma­tion about the in­put or from a sub­query to other copies of it­self with­out look­ing at them.

To give a sense for the size of the in­put set, in Univer­sal­ity and Se­cu­rity Am­plifi­ca­tion, Paul spec­u­lates that in­put sets of size 10⁸ would prob­a­bly be safe and in­puts sets of size 10³⁰ would prob­a­bly be un­safe.

To give a sense of what this im­plies, the range of in­put size 10⁸ to 10³⁰ roughly cor­re­sponds to:

  • 6 to 24 pix­els of ar­bi­trary 4-bit greyscale images

  • 10 to 38 words from typ­i­cal en­glish text

  • ar­bi­trary phrases of 2 to 7 words from a vo­cab­u­lary of 10000 words

  • ar­bi­trary strings of 5 to 20 low­er­case alpha­betic characters

A source of po­ten­tial con­fu­sion is that Paul still talks about a hu­man fulfilling the role of the low band­width over­seer (there are rea­sons for this that I won’t cover here). But when the space of in­for­ma­tion the hu­man over­seer can con­sider is re­duced to a finite set, we could sim­ply eval­u­ate what the hu­man does on ev­ery el­e­ment in that set and pro­duce a lookup table that re­places the hu­man. In other words, if you don’t think that some task could be ac­com­plished by an am­plified lookup table as over­seer, then you shouldn’t think it could be ac­com­plished with a low band­width over­seer.

The rea­son for this sys­tem be­ing in­tro­duced is want­ing to avoid se­cu­rity is­sues as the sys­tem scales. The fear is that there would be an “at­tack” on the sys­tem: an in­put that could be shown to an over­seer that would cause the over­seer to be­come cor­rupted and try to sab­o­tage the sys­tem. This could be some kind of mis­lead­ing philo­soph­i­cal ar­gu­ment, some form of black­mail, a hu­man ad­ver­sar­ial ex­am­ple, etc. If an in­put like this ex­ists, then as soon as the first agent is cor­rupted, it can try to spread the at­tack to other agents. The first agent could be cor­rupted ei­ther by chance, or through an at­tack be­ing in­cluded in the in­put. Paul’s pro­posal in Univer­sal­ity and Se­cu­rity Am­plifi­ca­tion is to avoid this by limit­ing the set of in­puts that the over­seer can pro­cess, and then use var­i­ous tech­niques to ver­ify that the over­seer is not at­tack­able on that set of in­puts. I read this as im­ply­ing that he thinks that a high band­width over­seer will be suffi­ciently vuln­er­a­ble to at­tack that af­ter some num­ber of rounds of am­plifi­ca­tion, the over­seer would be likely to be­come mis­al­igned.

How does a low band­width over­seer im­pact the difficult of IDA?

Pre­vi­ously, I was mostly start­ing with the model of a high band­width over­seer when think­ing about IDA. I now think that Paul’s ap­proach is in­tended to be us­ing low band­width over­sight, and that rea­son­ing about the difficulty of the ap­proach re­quires us­ing this model.

Us­ing a low band­width over­seer more clearly re­quires solv­ing a set of hard prob­lems prior to us­ing IDA at all:

  • Figure out how to de­com­pose tasks into low-band­width pieces. The task de­com­po­si­tion needs to be both cor­rigible and ca­pa­ble.

    • Cor­rigi­bil­ity re­quires un­der­stand­ing of AI safety con­cepts. For ex­am­ple, break­ing down the task “What ac­tion does the user want me to take?” into the two sub­tasks “What are the user’s val­ues?” and “What ac­tion is best ac­cord­ing to these val­ues”? is not cor­rigible. It pro­duces an ac­tion op­ti­mised for some ap­prox­i­mate model of the user’s val­ues, which could be mis­al­igned.

    • Ca­pa­bil­ity re­quires hav­ing the abil­ity to com­plete “enough” tasks. Paul ac­knowl­edges that the over­seer will no longer be able to di­rectly perform some tasks as well as a hu­man (ie. they won’t be able to trans­late sen­tences as well if they only are able to look at small pieces of the origi­nal text). But in this case, he claims that it would be okay to un­der­stand how to safely use in­for­ma­tion pro­vided by ex­ter­nal re­sources (ie. hu­man trans­la­tors).

  • Clar­ify­ing the cur­rently im­pre­cise defi­ni­tion of “cor­rigi­bil­ity” in or­der to be able to rea­son about whether a given task de­com­po­si­tion is cor­rigible.

  • Suffi­ciently un­der­stand meta-philos­o­phy to im­ple­ment it ex­plic­itly us­ing col­lab­o­ra­tions of “lookup table over­seers”. Even if we have an over­seer that is cor­rigible and ca­pa­ble of perform­ing well un­der­stood and defined tasks, it’s not clear that if we ask it to start eval­u­at­ing tricky philo­soph­i­cal ques­tions it will come up with al­igned an­swers. This roughly looks like figur­ing out an in­tractably large al­gorithm that will out­put the prod­ucts of MIRI’s Highly Reli­able Agent De­sign re­search (ie. pro­duce a set of crite­ria for what makes a good de­ci­sion the­ory, a search space of pos­si­ble de­ci­sion the­o­ries, a prior over that space).

When con­sid­er­ing the high band­width over­sight regime, these prob­lems seem more like they could be (at least par­tially) del­e­gated to high band­width over­seer at run­time. For ex­am­ple, one could think of prepar­ing a sum­mary of the rele­vant fac­tors of a de­ci­sion and giv­ing it to a highly com­pe­tent hu­man over­seer. The hu­man over­seer might be able to make good enough de­ci­sions with­out hav­ing an ex­plicit model of de­ci­sion the­ory (es­pe­cially if odd edge cases don’t come up in prac­tice).

Eval­u­at­ing IDA

There are a num­ber of lines of rea­son­ing that one could use to try to get a sense of whether IDA could work, or how it should be pri­ori­tized rel­a­tive to other ap­proaches. Here, I sum­ma­rize some of these con­sid­er­a­tions and in­di­cate which ver­sions of over­sight they ap­ply to, as I think this is im­por­tant to be clear about. I don’t make claims about how these should be weighed, or claim to be ex­haus­tive (there are ad­di­tional con­sid­er­a­tions that would weigh on both forms of over­sight, or other parts of IDA).

Eval­u­at­ing High Band­width Oversight

Pos­i­tive Considerations

  • Rea­sons for gen­eral op­ti­mism about “learn­ing to rea­son from hu­mans”, such as those that Daniel Dewey dis­cusses in his thoughts on MIRI’s “highly re­li­able agent de­sign” work. For ex­am­ple, the in­di­vi­d­ual over­seers are act­ing like hu­mans. This would lead the sys­tem to con­sider prob­lems in the way that hu­mans would, in­cor­po­rat­ing com­mon sense about hu­man mo­ti­va­tions, com­ing up with solu­tions that are eas­ier for hu­mans to un­der­stand and less likely to con­tain weird sur­prises, and op­er­at­ing on in­puts that are un­der­stand­able.

  • If it seems like high band­width task de­com­po­si­tion could solve difficult prob­lems (which is the ques­tion that Ought is run­ning ex­per­i­ments to in­ves­ti­gate)

  • IDA with high band­width over­sight seems like some ver­sion of it could be im­ple­mented today

  • High band­width over­sight seems that it could be im­ple­mented with­out solv­ing many of the hard philo­soph­i­cal/​math­e­mat­i­cal prob­lems that are faced by HRAD (de­ci­sion the­ory, in­duc­tion, etc.), in­stead sub­sti­tut­ing them for difficult ma­chine learn­ing prob­lems.

Nega­tive Con­sid­er­a­tion: Not Scal­able and Secure

In high band­width over­sight, if one copy of the over­seer is cor­rupted, this cor­rup­tion could eas­ily spread. As more copies of the over­seer are used to solve a prob­lem, there is a higher chance of one copy be­com­ing cor­rupted. Or an ad­ver­sary could in­tro­duce a cor­rupt­ing in­put into the sys­tem.

Eval­u­at­ing Low Band­width Oversight

Pos­i­tive Considerations

  • Paul’s in­tu­itions sug­gest that it is pos­si­ble to work with low band­width over­sight, and he has dis­cussed some as­pects of how var­i­ous prob­lems might be han­dled.

  • If it seems plau­si­ble that meta-philos­o­phy might be eas­ier than ob­ject level philos­o­phy—ie. defin­ing an al­gorithm that we can’t run di­rectly but out­puts the cor­rect de­ci­sion the­ory might be eas­ier than spec­i­fy­ing the cor­rect de­ci­sion the­ory di­rectly.

  • Paul claims that this ap­proach will scale with in­creas­ing ca­pa­bil­ity for any dis­til­la­tion al­gorithm.

Nega­tive Con­sid­er­a­tion: No longer “learn­ing to rea­son from hu­mans”

I think that IDA with a low band­width over­seer is not ac­cu­rately de­scribed as “AI learns to rea­son from hu­mans”, rather more “Hu­mans figure out how to rea­son ex­plic­itly, then the AI learns from the ex­plicit rea­son­ing”. As Wei Dai has pointed out, am­plified low band­width over­sight will not ac­tu­ally end up rea­son­ing like a hu­man. Hu­mans have im­plicit knowl­edge that helps them perform tasks when they see the whole task. But not all of this knowl­edge can be un­der­stood and break into smaller pieces. Low band­width over­sight re­quires that the over­seer not use any of this knowl­edge.

Now, it’s quite pos­si­ble that perfor­mance still could be re­cov­ered by do­ing things like search­ing over a solu­tion space, or by rea­son­ing about when it is safe to use train­ing data from in­se­cure hu­mans. But these solu­tions could look quite differ­ent from hu­man rea­son­ing. In dis­cus­sion on Univer­sal­ity Am­plifi­ca­tion, Paul de­scribes why he thinks that a low band­width over­seer could still perform image clas­sifi­ca­tion, but the pro­cess looks very differ­ent from a hu­man us­ing their vi­sual sys­tem to in­ter­pret the image:

“I’ve now played three rounds of the fol­low­ing game (in­spired by Ge­offrey Irv­ing who has been think­ing about de­bate): two de­baters try to con­vince a judge about the con­tents of an image, e.g. by say­ing “It’s a cat be­cause it has pointy ears.” To jus­tify these claims, they make still sim­pler claims, like “The left ears is ap­prox­i­mately sep­a­rated from the back­ground by two lines that meet at a 60 de­gree an­gle.” And so on. Ul­ti­mately if the de­baters dis­agree about the con­tents of a sin­gle pixel then the judge is al­lowed to look at that pixel. This seems to give you a tree to re­duce high-level claims about the image to low-level claims (which can be fol­lowed in re­verse by am­plifi­ca­tion to clas­sify the image). I be­lieve the hon­est de­bater can quite eas­ily win this game, and that this pretty strongly sug­gests that am­plifi­ca­tion will be able to clas­sify the image.”

Con­clu­sion: Weigh­ing Ev­i­dence for IDA

The im­por­tant take­away is that con­sid­er­ing IDA re­quires clar­ify­ing whether you are con­sid­er­ing IDA with high or low band­width over­sight. Then, only count con­sid­er­a­tions that ac­tu­ally ap­ply to that ap­proach. I think there’s a way to mi­s­un­der­stand the ap­proach where you mostly think about high band­width over­sight and count the feel­ing like it’s some­what un­der­stand­able, feels plau­si­ble to you that it could work and that it avoids some hard prob­lems. But if you then also count Paul’s opinion that it could work, you may be over­con­fi­dent—the ap­proach that Paul claims is most likely to work is the low band­width over­sight ap­proach.

Ad­di­tion­ally, I think it’s use­ful to con­sider both mod­els as al­ter­na­tive tools for un­der­stand­ing over­sight: for ex­am­ple, the prob­lems in low band­width over­sight might be less ob­vi­ous but still im­por­tant to con­sider in the high band­width over­sight regime.

After un­der­stand­ing this, I am more ner­vous about whether Paul’s ap­proach would work if im­ple­mented, due to the ad­di­tional com­pli­ca­tions of work­ing with low band­width over­sight. I am some­what op­ti­mistic that fur­ther work (such as flesh­ing out how par­tic­u­lar prob­lems could be ad­dress through low band­width over­sight) will shed light on this is­sue, and ei­ther make it seem more likely to suc­ceed or yield more un­der­stand­ing of why it won’t suc­ceed. I’m also still op­ti­mistic about Paul’s ap­proach yield­ing ideas or in­sights that could be use­ful for de­sign­ing al­igned AIs in differ­ent ways.

Caveat: high band­width over­sight could still be use­ful to work on

High band­width over­sight could still be use­ful to work on for the fol­low­ing rea­sons:

  • If you think that other solu­tions could be found to the se­cu­rity prob­lem in high band­width over­sight. Paul claims that low band­width over­sight is the most likely solu­tion to se­cu­rity is­sues within the over­seer, but he thinks it may be pos­si­ble to make IDA with high band­width over­sight se­cure us­ing var­i­ous tech­niques for op­ti­miz­ing worst-case perfor­mance on the fi­nal dis­til­led agent, even if the over­seer is in­se­cure. (see https://​ai-al­ign­ment.com/​two-guaran­tees-c4c03a6b434f)

  • It could help make progress on low band­width over­sight. If high band­width over­sight fails, then so will low band­width over­sight. If high band­width over­sight suc­ceeds, then we might be able to break down each of the sub­tasks into low band­width tasks, di­rectly yield­ing a low band­width over­seer). I think the fac­tored cog­ni­tion ex­per­i­ments planned by Ought plau­si­bly fall into this cat­e­gory.

  • If you think it could be used as a medium-term al­ign­ment solu­tion or a fal­lback plan if no other al­ign­ment ap­proach is ready in time. This seems like it would only work if it is used for limited tasks and a limited amount of time, in or­der to ex­tend the time win­dow for prepar­ing a truly scal­able ap­proach. In this sce­nario, it would be very use­ful to have tech­niques that could help us un­der­stand how far the ap­proach could be scaled be­fore failure.