Karma: 492
Page 1

# Un­der­stand­ing Iter­ated Distil­la­tion and Am­plifi­ca­tion: Claims and Oversight

17 Apr 2018 22:36 UTC
70 points

# Ma­chine Learn­ing Pro­jects on IDA

24 Jun 2019 18:38 UTC
50 points

# Am­plifi­ca­tion Dis­cus­sion Notes

1 Jun 2018 19:03 UTC
41 points

# HCH is not just Me­chan­i­cal Turk

9 Feb 2019 0:46 UTC
37 points

# Re­in­force­ment Learn­ing in the Iter­ated Am­plifi­ca­tion Framework

9 Feb 2019 0:56 UTC
24 points
• Christ­mas.

• Among peo­ple I’ve had sig­nifi­cant on­line dis­cus­sions with, your writ­ings on al­ign­ment tend to be the hard­est to un­der­stand and eas­iest to mi­s­un­der­stand.

Ad­di­tion­ally, I think that there are ways to mi­s­un­der­stand the IDA ap­proach that leave out sig­nifi­cant parts of the com­plex­ity (ie. IDA based off of hu­mans think­ing for a day with un­re­stricted in­put, with­out do­ing the hard work of try­ing to un­der­stand cor­rigi­bil­ity and meta-philos­o­phy be­fore­hand), but can seem to be plau­si­ble things to talk about in terms of “solv­ing the AI al­ign­ment prob­lem” if one hasn’t un­der­stood the more sub­tle prob­lems that would oc­cur. It’s then easy to miss the prob­lems and feel op­ti­mistic about IDA work­ing while un­der­es­ti­mat­ing the amount of hard philo­soph­i­cal work that needs to be done, or to in­cor­rectly at­tack the ap­proach for miss­ing the prob­lems com­pletely.

(I think that these sim­pler ver­sions of IDA might be worth think­ing about as a plau­si­ble fal­lback plan if no other al­ign­ment ap­proach is ready in time, but only if they are re­stricted in terms of ac­com­plish­ing spe­cific tasks to sta­bil­ise the world, re­stricted in how far the am­plifi­ca­tion is tak­ing, re­placed with some­thing bet­ter as soon as pos­si­ble, etc. I also think that work­ing on sim­ple ver­sions of IDA can help make progress on is­sues that would be re­quired for fully scal­able IDA, ie. the ex­per­i­ments that Ought is run­ning.).

• Paul, to what de­gree do you think your ap­proach will scale in­definitely while main­tain­ing cor­rigi­bil­ity vs. just think­ing that it will scale while main­tain­ing cor­rigi­bil­ity to the point where we “get our house in or­der”? (I feel like this would help me in un­der­stand­ing the im­por­tance of par­tic­u­lar ob­jec­tions, though ob­jec­tions rele­vant to both sce­nar­ios are prob­a­bly still rele­vant).

• So I also don’t see how Paul ex­pects the pu­ta­tive al­ign­ment of the lit­tle agents to pass through this mys­te­ri­ous ag­gre­ga­tion form of un­der­stand­ing, into al­ign­ment of the sys­tem that un­der­stands Hes­sian-free op­ti­miza­tion.

My model of Paul’s ap­proach sees the al­ign­ment of the sub­agents as just tel­ling you that no sub­agent is try­ing to ac­tively sab­o­tage your sys­tem (ie. by op­ti­miz­ing to find the worst pos­si­ble an­swer to give you), and that the al­ign­ment comes from hav­ing thought care­fully about how the sub­agents are sup­posed to act in ad­vance (in a way that could po­ten­tially be run just by us­ing a lookup table).

• From my cur­rent un­der­stand­ing of Paul’s IDA ap­proach, I think there are two differ­ent senses in which cor­rigi­bil­ity can be thought about in re­gards to IDA, both with differ­ent lev­els of guaran­tee.

1. On av­er­age, the re­ward func­tion in­cen­tivizes be­havi­our which com­petes effec­tively and gives the user effec­tive con­trol.
2. There do not ex­ist in­puts on which the policy choose an ac­tion be­cause it is bad, or the value func­tion out­puts a high re­ward be­cause the prior be­havi­our was bad. (Or else the policy on its own will gen­er­ate bad con­se­quences.)
3. The re­ward func­tion never gives a be­havi­our a higher re­ward be­cause it is bad. (Or else the test-time op­ti­mi­sa­tion by MCTS can gen­er­ate bad be­havi­our.) For ex­am­ple, if the AI de­ludes the hu­man op­er­a­tor so that the op­er­a­tor can’t in­terfere with the AI’s be­havi­our, that be­havi­our can’t re­ceive a higher re­ward even if it ul­ti­mately al­lows the AI to make more money.

Prop­erty 1 is deal­ing with “con­se­quence cor­rigi­bil­ity” (com­pe­tence at pro­duc­ing ac­tions that will pro­duce out­comes in the world we would de­scribe as cor­rigible)

Prop­er­ties 2&3 are deal­ing with cor­rigi­bil­ity in terms of “in­tent cor­rigi­bil­ity” (guaran­tee­ing that the sys­tem does not op­ti­mise for bad out­comes). This does not cover the agent in­com­pe­tently caus­ing bad ac­tions in the world, only the agent de­liber­ately try­ing to pro­duce bad out­comes.

I think IDA doesn’t re­quire or claim worst-case guaran­tees on the task of “con­se­quence cor­rigi­bil­ity” (and that this is an im­pos­si­ble goal for bounded rea­son­ers).

I think that av­er­age-case good perfor­mance on “con­se­quence cor­rigi­bil­ity” is claimed by IDA, but only as a sub­set of gen­eral com­pe­tence.

I think that pro­vid­ing worst-case guaran­tees on “in­tent cor­rigi­bil­ity” is re­quired and claimed by IDA.

Roughly, I think that:

• Ver­sions of IDA that al­low the over­seer nodes more in­for­ma­tion could be gen­er­ally com­pe­tent (in­clud­ing pre­dict­ing what be­havi­our could be cor­rigible), but could fail to be “in­tent cor­rigible”

• Ver­sions of IDA that al­low the over­seer nodes only a highly re­stricted set of queries could be “in­tent cor­rigible” but fail to be gen­er­ally com­pe­tent, and hence not be “con­se­quence cor­rigible”

• Stan­dard ML ap­proaches will, at some level of op­ti­mi­sa­tion power, fail to be­have “in­tent cor­rigibly” (even if you train them to be “con­se­quence cor­rigible”)

The ques­tion I’m un­cer­tain about is whether there’s a mid­dle point in trade­off space where both prop­er­ties are suffi­ciently satis­fied to pro­duce good out­comes.

Do you agree or dis­agree with how I’ve bro­ken down cor­rigi­bil­ity claims for IDA, and which claims do you think your ar­gu­ment bears on?

• I would see the benefits of hu­mans vs. al­gorithms be­ing that giv­ing a hu­man a bunch of nat­u­ral lan­guage in­struc­tions would be much eas­ier (but harder to ver­ify) than writ­ing down a for­mal al­gorithm. Also, the train­ing could just cover how to avoid tak­ing in­cor­rigible ac­tions, and the Overseer could still use their judge­ment of how to perform com­pe­tently within the space of cor­rigible out­puts.

• Paul, it might be helpful to clar­ify the sort of things you think your ap­proach re­lies upon in re­gards to bounds on the amount of over­head (train­ing time, hu­man sam­ple com­plex­ity), or the amount of over­head that would doom your ap­proach. If I re­call cor­rectly, I think you’ve wanted the ap­proach to have some rea­son­able con­stant over­head rel­a­tive to an un­al­igned sys­tem, though I can’t find the post at the mo­ment? It might also be helpful to have bounds, or at least your guesses on the mag­ni­tude of num­bers re­lated to in­di­vi­d­ual com­po­nents (ie. the rough num­bers in the Univer­sal­ity and Se­cu­rity am­plifi­ca­tion post).

• Open Ques­tion: Work­ing with con­cepts that the hu­man can’t understand

Ques­tion: when we need to as­sem­ble com­plex con­cepts by learn­ing/​in­ter­act­ing with the en­vi­ron­ment, rather than us­ing H’s con­cepts di­rectly, and when those con­cepts in­fluence rea­son­ing in sub­tle/​ab­stract ways, how do we re­tain cor­rigi­bil­ity/​al­ign­ment?

Paul: I don’t have any gen­eral an­swer to this, seems like we should prob­a­bly choose some ex­am­ple cases. I’m prob­a­bly go­ing to be ad­vo­cat­ing some­thing like “Search over a bunch of pos­si­ble con­cepts and find one that does what you want /​ has the de­sired prop­er­ties.”

E.g. for el­e­gant proofs, you want a heuris­tic that gives suc­cess­ful lines of in­quiry higher scores. You can ex­plore a bunch of con­cepts that do that, eval­u­ate each one ac­cord­ing to how well it dis­crim­i­nates good from bad lines of in­quiry, and also eval­u­ate other stuff like “What would I in­fer from learn­ing that a proof is el­e­gant other than that it will work” and make sure that you are OK with that.

An­dreas: Sup­pose you don’t have the con­cepts of “proof” and “in­quiry”, but learned them (or some more so­phis­ti­cated analogs) us­ing the sort of pro­ce­dure you out­lined be­low. I guess I’m try­ing to see in more de­tail that you can do a good job at “mak­ing sure you’re OK with rea­son­ing in ways X” in cases where X is far re­moved from H’s con­cepts. (Un­for­tu­nately, it seems to be difficult to make progress on this by dis­cussing par­tic­u­lar ex­am­ples, since ex­am­ples are nec­es­sar­ily about con­cepts we know pretty well.)

This may be re­lated to the more gen­eral ques­tion of what sorts of in­struc­tions you’d give H to en­sure that if they fol­low the in­struc­tions, the over­all pro­cess re­mains cor­rigible/​al­igned.

• I think the way to do ex­po­nen­tial search in am­plifi­ca­tion with­out be­ing ex­po­nen­tially slow is to not try to do the search in one am­plifi­ca­tion step, but start with smaller prob­lems, learn how to solve those effi­ciently, then use that knowl­edge to speed up the search in later iter­a­tion-am­plifi­ca­tion rounds.

Sup­pose we have some prob­lem with branch­ing fac­tor 2 (ie. search­ing for bi­nary strings that fit some crite­ria)

Am­plify agent to solve prob­lems which re­quire search­ing a tree of depth at cost .

Distill agent , which uses the out­put of the am­plifi­ca­tion pro­cess to learn how to solve prob­lems of depth faster than the am­plified , ideally as fast as any other ML ap­proach. One way would be to learn heuris­tics for which parts of the tree don’t con­tain use­ful in­for­ma­tion, and can be pruned.

Am­plify agent , which can use the heuris­tics it has learned to prune the tree much ear­lier and solve prob­lems of depth at cost

Distill agent , which can now effi­ciently solve prob­lems of depth

If this pro­cess is effi­cient enough, the train­ing cost can be less than to get an agent that solves prob­lems of depth (and the run­time cost is as good as the run­time cost of the ML al­gorithm that im­ple­ments the dis­til­led agent)

• It seems brit­tle. If there’s mis­com­mu­ni­ca­tion at any level of the hi­er­ar­chy, you run the risk of break­age. Fatal mis­com­mu­ni­ca­tions could hap­pen as in­for­ma­tion trav­els ei­ther up or down the hi­er­ar­chy.

It seems to me that the am­plifi­ca­tion scheme could in­clude re­dun­dant pro­cess­ing/​er­ror cor­rec­tion—ie. ask sub­or­di­nates to solve a prob­lem in sev­eral differ­ent ways, then look at whether they dis­agree and take ma­jor­ity vote or flag dis­agree­ments as in­di­cat­ing that some­thing dan­ger­ous is go­ing on, and this could deal with this sort of prob­lem.

The frame­work does not ap­pear to have a clear pro­vi­sion for adapt­ing its value learn­ing to the pres­ence/​ab­sence of de­ci­sive strate­gic ad­van­tage. The ideal FAI will slow down and spend a lot of time ask­ing us what we want once de­ci­sive strate­gic ad­van­tage has been achieved. With your thing, it ap­pears as though this would re­quire an awk­ward re­train­ing pro­cess.

It seems to me that bal­anc­ing the risks of act­ing vs. tak­ing time to ask ques­tions de­pend­ing on the cur­rent situ­a­tion falls un­der Paul’s no­tion of cor­rigi­bil­ity, so it would hap­pen ap­pro­pri­ately (as long as you main­tain the pos­si­blity of ask­ing ques­tions as an out­put of the sys­tem, and the in­put ap­pro­pri­ately de­scribes the state of the world rele­vant to eval­u­at­ing whether you have de­ci­sive strate­gic ad­van­tage)

• I would solve X-and-only-X in two steps:
First, given an agent and an ac­tion which has been op­ti­mized for un­de­sir­able con­se­quence Y, we’d like to be able to tell that the ac­tion has this un­de­sir­able side effect. I think we can do this by hav­ing a smarter agent act as an over­seer, and giv­ing the smarter agent suit­able in­sight into the cog­ni­tion of the weaker agent (e.g. by shar­ing weights be­tween the weak agent and an ex­pla­na­tion-gen­er­at­ing agent). This is what I’m call­ing in­formed over­sight.
Se­cond, given an agent, iden­tify situ­a­tions in which it is es­pe­cially likely to pro­duce bad out­comes, or proofs that it won’t, or enough un­der­stand­ing of its in­ter­nals that you can see why it won’t. This is dis­cussed in “Tech­niques for Op­ti­miz­ing Worst-Case Perfor­mance.”

Paul, I’m cu­ri­ous whether you’d see as nec­es­sary for these tech­niques to work to have that the op­ti­miza­tion tar­get is pretty good/​safe (but not perfect): ie some safety comes from the fact that the agents op­ti­mized for ap­proval or imi­ta­tion only have a limited class of Y’s that they might also end up be­ing op­ti­mized for.

• Open Ques­tion: Sever­ity of “Hon­est Mis­takes”

In the dis­cus­sion about cre­ative prob­lem solv­ing,Paul said that he was con­cerned about prob­lems aris­ing when the solu­tion gen­er­a­tor was de­liber­ately search­ing for a solu­tion with harm­ful side effects. Other failures could oc­cur where the solu­tion gen­er­a­tor finds a solu­tion with harm­ful side effects with­out “de­liber­ately search­ing” for it. The ques­tion is how bad these “hon­est mis­takes” would end up be­ing.

Paul: I also want to make the fur­ther claim that such failures are much less con­cern­ing than what-I’m-call­ing-al­ign­ment failures, which is a pos­si­ble dis­agree­ment we could dig into (I think Wei Dai dis­agrees or is very un­sure).

• What if the cur­rent node is re­spon­si­ble for the er­ror in­stead of one of the sub­queries, how do you figure that out?

I think you’d need to form the de­com­po­si­tion in such a way that you could fix any prob­lem through per­turb­ing some­thing in the
world rep­re­sen­ta­tion (an ex­treme ver­sion is you have the method for perform­ing ev­ery op­er­a­tion con­tained in the world rep­re­sen­ta­tion and looked up, so you can ad­just it in the fu­ture).

When you do back­prop, you prop­a­gate the er­ror sig­nal through all the nodes, not just through a sin­gle path that is “most re­spon­si­ble” for the er­ror, right? If you did this with meta-ex­e­cu­tion, wouldn’t it take an ex­po­nen­tial amount of time?

One step of this method, as in back­prop, is the same time com­plex­ity as the for­ward pass (run­ning meta-ex­e­cu­tion for­ward, which I wouldn’t call ex­po­nen­tial com­plex­ity, as I think the rele­vant baseline is the num­ber of nodes in the meta-ex­e­cu­tion for­ward tree). You only need to pro­cess each node once (when the back­prop sig­nal for it’s out­put is ready), and need to do a con­stant amount of work at each node (figure out all the ways to per­turb the nodes in­put).

The catch is that, as with back­prop, maybe you need to run mul­ti­ple steps to get it to ac­tu­ally work.

And what about nodes that are purely sym­bolic, where there are mul­ti­ple ways the subn­odes (or the cur­rent node) could have caused the er­ror, so you couldn’t use the right an­swer for the cur­rent node to figure out what the right an­swer is from each subn­ode? (Can you in gen­eral struc­ture the task tree to avoid this?)

The de­fault back­prop an­swer to this is to shrug and ad­just all of the in­puts (which is what you get from tak­ing the first or­der gra­di­ent). If this causes prob­lems, then you can fix them in the next gra­di­ent step. That seems to work in prac­tice for back­prop in con­tin­u­ous mod­els. Discrete mod­els like this it might be a bit more difficult—if you start to try out differ­ent com­bi­na­tions to see if they work, that’s where you’d get ex­po­nen­tial com­plex­ity. But we’d get to counter this by po­ten­tially hav­ing cases where, based on un­der­stand­ing the op­er­a­tion, we could in­tel­li­gently avoid some branches—I think this could po­ten­tially wash out to lin­ear com­plex­ity in the num­ber of for­ward nodes if it all works well.

I won­der if we’re on the right track at all, or if Paul has an en­tirely differ­ent idea about this.

So do I :)

• Huh, I hadn’t thought of this as try­ing to be a di­rect analogue of gra­di­ent de­scent, but now that I think about your com­ment that seems like an in­ter­est­ing way to ap­proach it.

A hu­man de­bug­ging a trans­la­tion soft­ware could look at the re­turn value of some high-level func­tion and ask “is this re­turn value sen­si­ble” us­ing their own lin­guis­tic in­tu­ition, and then if the an­swer is “no”, trace the ex­e­cu­tion of that func­tion and ask the same ques­tion about each of the func­tion it calls. This kind of de­bug­ging does not seem available to meta-ex­e­cu­tion try­ing to de­bug it­self, so I just don’t see any way this kind of learn­ing /​ er­ror cor­rec­tion could work.

I think in­stead of ask­ing “is this re­turn value sen­si­ble”, the de­bug­ging over­seer pro­cess could start with some com­pu­ta­tion node where it knows what the re­turn value should be (the fi­nal an­swer), and look at each of the sub­queries of that node and ask for each sub­query “how can I mod­ify the an­swer to make the query an­swer more cor­rect”, then re­curse into the sub­query. This seems pretty analo­gous to gra­di­ent de­scent, with the po­ten­tial ad­van­tage that the over­seer’s un­der­stand­ing of the func­tion at each node could be bet­ter than naively tak­ing the gra­di­ent (un­der­stand­ing the op­er­a­tion could yield some­thing that takes into ac­count higher-or­der terms in the op­er­a­tion).

I’m cu­ri­ous now whether you could run a more effi­cient ver­sion of gra­di­ent de­scent if you re­place the gra­di­ent at each step with an over­seer hu­man who can har­ness some in­tu­ition to try to do bet­ter than the gra­di­ent.

• Brain­storm­ing ap­proaches to work­ing with causal goodhart

• Low-im­pact mea­sures that in­clude the change in the causal struc­ture of the world. It might be pos­si­ble to form a mea­sure like this which doesn’t de­pend on re­cov­er­ing the true causal struc­ture at any point (ie. min­i­miz­ing the differ­ence be­tween pre­dic­tions of causal struc­ture in state A and B, even if both of those pre­dic­tions are wrong)

• Figure out how to elicit hu­man mod­els of causal struc­ture, and provide the hu­man model of causal struc­ture along with the met­ric, and the AI uses this in­for­ma­tion to figure out whether it’s vi­o­lat­ing the as­sump­tions that the hu­man made

• Causal trans­parency: have the AI ex­plain the causal struc­ture of how it’s plans will in­fluence the proxy. This might al­low a hu­man to figure out whether the plan will cause the proxy to di­verge from the goal. ie. True goal is hap­piness, proxy is hap­piness score as mea­sured by on­line psy­cholog­i­cal ques­tion­naire, AI’s plan says that it will in­fluence the proxy by hack­ing into the on­line psy­cholog­i­cal ques­tion­naire. You don’t need to un­der­stand how the AI plans to hack into the server to un­der­stand that the plan is di­verg­ing the proxy from the goal.