• RL is typ­i­cally about se­quen­tial de­ci­sion-mak­ing, and I wasn’t sure where the “se­quen­tial” part came in).

I guess I’ve used the term “re­in­force­ment learn­ing” to re­fer to a broader class of prob­lems in­clud­ing both one-shot ban­dit prob­lems and se­quen­tial de­ci­sion mak­ing prob­lems. In this view The fea­ture that makes RL differ­ent from su­per­vised learn­ing is not that we’re try­ing to figure out what how to act in an MDP/​POMDP, but in­stead that we’re try­ing to op­ti­mize a func­tion that we can’t take the deriva­tive of (in the MDP case, it’s be­cause the en­vi­ron­ment is non-differ­en­tiable, and in the ap­proval learn­ing case, it’s be­cause the over­seer is non-differ­en­tiable).

• Re: sce­nario 3, see The Evitable Con­flict, the last story in Isaac Asi­mov’s “I, Robot”:

“Stephen, how do we know what the ul­ti­mate good of Hu­man­ity will en­tail? We haven’t at our dis­posal the in­finite fac­tors that the Ma­chine has at its! Per­haps, to give you a not un­fa­mil­iar ex­am­ple, our en­tire tech­ni­cal civ­i­liza­tion has cre­ated more un­hap­piness and mis­ery than it has re­moved. Per­haps an agrar­ian or pas­toral civ­i­liza­tion, with less cul­ture and less peo­ple would be bet­ter. If so, the Machines must move in that di­rec­tion, prefer­ably with­out tel­ling us, since in our ig­no­rant prej­u­dices we only know that what we are used to, is good – and we would then fight change. Or per­haps a com­plete ur­ban­iza­tion, or a com­pletely caste-rid­den so­ciety, or com­plete an­ar­chy, is the an­swer. We don’t know. Only the Machines know, and they are go­ing there and tak­ing us with them.”
• Yeah, to some ex­tent. In the Lookup Table case, you need to have a (po­ten­tially quite ex­pen­sive) way of re­solv­ing all mis­takes. In the Overseer’s Man­ual case, you can also lev­er­age hu­mans to do some kind of more ro­bust rea­son­ing (for ex­am­ple, they can no­tice a typo in a ques­tion and still re­spond cor­rectly, even if the Lookup Table would fail in this case). Though in low-band­width over­sight, the space of things that par­ti­ci­pants could no­tice and cor­rect is fairly limited.

Though I think this still differs from HRAD in that it seems like the out­put of HRAD would be a much smaller thing in terms of de­scrip­tion length than the Lookup Table, and you can buy ex­tra ro­bust­ness by adding many more hu­man-rea­soned things into the Lookup Table (ie. au­to­mat­i­cally add ver­sions of all ques­tions with ty­pos that don’t change the mean­ing of a ques­tion into the Lookup Table, add 1000 differ­ent san­ity check ques­tions to flag that things can go wrong).

So I think there are ad­di­tional ways the sys­tem could cor­rect mis­taken rea­son­ing rel­a­tive to what I would think the out­put of HRAD would look like, but you do need to have pro­cesses that you think can cor­rect any way that rea­son­ing goes wrong. So the prob­lem could be less difficult than HRAD, but still tricky to get right.

• Thanks, this po­si­tion makes more sense in light of Beyond Astro­nom­i­cal Waste (I guess I have some con­cept of “a pretty good fu­ture” that is fine with some­thing like a bunch of hu­man-de­scended be­ings liv­ing a happy lives that misses out on the sort of things men­tioned in Beyond Astro­nom­i­cal Waste, and “op­ti­mal fu­ture” which in­cludes those con­sid­er­a­tions). I buy this as an ar­gu­ment that “we should put more effort into mak­ing philos­o­phy work to make the out­come of AI bet­ter, be­cause we risk los­ing large amounts of value” rather than “our efforts to get a pretty good fu­ture are doomed un­less we make tons of progress on this” or some­thing like that.

“Thou­sands of mil­lions” was a typo.

• What is the mo­ti­va­tion for us­ing RL here?

I see the mo­ti­va­tion as given prac­ti­cal com­pute limits, it may be much eas­ier to have the sys­tem find an ac­tion the over­seer ap­proves of in­stead of imi­tat­ing the over­seer di­rectly. Us­ing RL also al­lows you to use any ad­vances that are made in RL by the ma­chine learn­ing com­mu­nity to try to re­main com­pet­i­tive.

• Would this still be a prob­lem if we were train­ing the agent with SL in­stead of RL?

Maybe this could hap­pen with SL if SL does some kind of large search and finds a solu­tion that looks good but is ac­tu­ally bad. The dis­til­led agent would then learn to iden­tify this ac­tion and re­pro­duce it, which im­plies the agent learn­ing some facts about the ac­tion to effi­ciently lo­cate it with much less com­pute than the large search pro­cess. Know­ing what the agent knows would al­low the over­seer to learn those facts, which might help in iden­ti­fy­ing this ac­tion as bad.

• I don’t un­der­stand why we want to find this X* in the imi­ta­tion learn­ing case.

Ah, with this ex­am­ple the in­tent was more like “we can frame what the RL case is do­ing as find­ing X* , let’s show how we could ac­com­plish the same thing in the imi­ta­tion learn­ing case (in the limit of un­limited com­pute)”.

The re­verse map­ping (imi­ta­tion to RL) just con­sists of ap­ply­ing re­ward 1 to M2′s demon­strated be­havi­our (which could be “ex­e­cute some safe search and re­turn the re­sults), and re­ward 0 to ev­ery­thing else.

What is pM(X∗)?

is the prob­a­bil­ity of out­putting (where is a stochas­tic policy)

M2(“How good is an­swer X to Y?“)∗∇log(pM(X))

This is the REINFORCE gra­di­ent es­ti­ma­tor (which tries to in­crease the log prob­a­bil­ity of ac­tions that were rated highly)

• I guess the ques­tion was more from the per­spec­tive of: if the cost was zero then it seems like it would worth run­ning, so what part of the cost makes it not worth run­ning (where I would think of cost as prob­a­bly time to judge or availa­bil­ity of money to fund the con­test).

• One im­por­tant di­men­sion to con­sider is how hard it is to solve philo­soph­i­cal prob­lems well enough to have a pretty good fu­ture (which in­cludes avoid­ing bad fu­tures). It could be the case that this is not so hard, but fully re­solv­ing ques­tions so we could pro­duce an op­ti­mal fu­ture is very hard or im­pos­si­ble. It feels like this ar­gu­ment im­plic­itly re­lies on as­sum­ing that “solve philo­soph­i­cal prob­lems well enough to have a pretty good fu­ture” is hard (ie. takes thou­sands of mil­lions of years in sce­nario 4) - can you provide fur­ther clar­ifi­ca­tion on whether/​why you think that is the case?

• Slightly dis­ap­pointed that this isn’t con­tin­u­ing (though I didn’t sub­mit to the prize, I sub­mit­ted to Paul Chris­ti­ano’s call for pos­si­ble prob­lems with his ap­proach which was similarly struc­tured). Was hop­ing that once I got fur­ther into my PhD, I’d have some more things worth writ­ing up, and the recog­ni­tion/​a bit of prize money would provide some ex­tra mo­ti­va­tion to get them out the door.

What do you feel like is the limit­ing re­source that keeps con­tin­u­ing this from be­ing use­ful to con­tinue in it’s cur­rent form?

• Yeah, this is a prob­lem that needs to be ad­dressed. It feels like in the Overseers Man­ual case you can coun­ter­act this by giv­ing defi­ni­tions/​ex­am­ples of how you want ques­tions to be in­ter­preted, and in the Lookup Table case this can be addr by co­or­di­na­tion within the team cre­at­ing the lookup table

• Do you think you’d agree with a claim of this form ap­plied to cor­rigi­bil­ity of plans/​poli­cies/​ac­tions?

That is: If some plan/​policy/​ac­tion is un­cor­rigible, then A can provide some de­scrip­tion of how the ac­tion is in­cor­rigible.

• The bet­ter we can solve the key ques­tions (“what are these ‘wiser’ ver­sions?“, “how is the whole setup de­signed?“, “what ques­tions ex­actly is it try­ing to an­swer?“), the bet­ter the wiser our­selves will be at their tasks.

I feel like this state­ment sug­gests that we might not be doomed if we make a bunch of progress, but not full progress on these state­ments. I agree with that as­sess­ment, but it felt on read­ing the post like the post was mak­ing the claim “Un­less we fully spec­ify a cor­rect the­ory of hu­man val­ues, we are doomed”.

I think that I’d view some­thing like Paul’s in­di­rect nor­ma­tivity ap­proach as re­quiring that we do enough think­ing in ad­vance to get some crit­i­cal set of con­sid­er­a­tions known by the par­ti­ci­pat­ing hu­mans, but once that’s in place we should be able to go from this core set to get the rest of the con­sid­er­a­tions. And it seems pos­si­ble that we can do this with­out a fully-solved the­ory of hu­man value (but any the­o­ret­i­cal progress in ad­vance we can make on defin­ing hu­man value is quite use­ful).

• My in­ter­pre­ta­tion of what you’re say­ing here is that the over­seer in step #1 can do a lot of things to bake in hav­ing the AI in­ter­pret “help the user get what they re­ally want” in ways that get the AI to try to elimi­nate hu­man safety prob­lems for the step #2 user (pos­si­bly en­tirely), but prob­lems might still oc­cur in the short term be­fore the AI is able to think/​act to re­move those safety prob­lems.

It seems to me that this im­plies that IDA es­sen­tially solves the AI al­ign­ment por­tion of points 1 and 2 in the origi­nal post (mod­ulo things hap­pen­ing be­fore the AI is in con­trol).

• Cor­rect­ing all prob­lems in the sub­se­quent am­plifi­ca­tion stage would be a nice prop­erty to have, but I think IDA can still work even if it cor­rects er­rors with mul­ti­ple A/​D steps in be­tween (as long as all catas­trophic er­rors are caught be­fore de­ploy­ment). For ex­am­ple, I could think of the agent ini­tially us­ing some rules for how to solve math prob­lems where dis­til­la­tion in­tro­duces some mis­take, but later in the IDA pro­cess the agent learns how to red­erive those rules and re­al­izes the mis­take.

• Shorter name can­di­dates:

In­duc­tively Aligned AI Development

In­duc­tively Aligned AI Assistants

• It’s a nice prop­erty of this model that it prompts con­sid­er­a­tion of the in­ter­ac­tion be­tween hu­mans and AIs at ev­ery step (to high­light things like risks of the hu­mans hav­ing ac­cess to some set of AI sys­tems for ma­nipu­la­tion or moral haz­ard rea­sons).

• In the higher di­men­sional be­lief/​re­ward space, do you think that it would be pos­si­ble to sig­nifi­cantly nar­row down the space of pos­si­bil­ities (so this ar­gu­ment is say­ing “be bayesian with re­spect to re­ward/​be­liefs, pick­ing poli­cies that work over a dis­tri­bu­tion) or are you more pes­simistic than that, think­ing that the un­cer­tainty would be so great in higher di­men­sional spaces that it would not be pos­si­ble to pick a good policy?

• Open Ques­tion: Work­ing with con­cepts that the hu­man can’t understand

Ques­tion: when we need to as­sem­ble com­plex con­cepts by learn­ing/​in­ter­act­ing with the en­vi­ron­ment, rather than us­ing H’s con­cepts di­rectly, and when those con­cepts in­fluence rea­son­ing in sub­tle/​ab­stract ways, how do we re­tain cor­rigi­bil­ity/​al­ign­ment?

Paul: I don’t have any gen­eral an­swer to this, seems like we should prob­a­bly choose some ex­am­ple cases. I’m prob­a­bly go­ing to be ad­vo­cat­ing some­thing like “Search over a bunch of pos­si­ble con­cepts and find one that does what you want /​ has the de­sired prop­er­ties.”

E.g. for el­e­gant proofs, you want a heuris­tic that gives suc­cess­ful lines of in­quiry higher scores. You can ex­plore a bunch of con­cepts that do that, eval­u­ate each one ac­cord­ing to how well it dis­crim­i­nates good from bad lines of in­quiry, and also eval­u­ate other stuff like “What would I in­fer from learn­ing that a proof is el­e­gant other than that it will work” and make sure that you are OK with that.

An­dreas: Sup­pose you don’t have the con­cepts of “proof” and “in­quiry”, but learned them (or some more so­phis­ti­cated analogs) us­ing the sort of pro­ce­dure you out­lined be­low. I guess I’m try­ing to see in more de­tail that you can do a good job at “mak­ing sure you’re OK with rea­son­ing in ways X” in cases where X is far re­moved from H’s con­cepts. (Un­for­tu­nately, it seems to be difficult to make progress on this by dis­cussing par­tic­u­lar ex­am­ples, since ex­am­ples are nec­es­sar­ily about con­cepts we know pretty well.)

This may be re­lated to the more gen­eral ques­tion of what sorts of in­struc­tions you’d give H to en­sure that if they fol­low the in­struc­tions, the over­all pro­cess re­mains cor­rigible/​al­igned.

• Open Ques­tion: Sever­ity of “Hon­est Mis­takes”

In the dis­cus­sion about cre­ative prob­lem solv­ing,Paul said that he was con­cerned about prob­lems aris­ing when the solu­tion gen­er­a­tor was de­liber­ately search­ing for a solu­tion with harm­ful side effects. Other failures could oc­cur where the solu­tion gen­er­a­tor finds a solu­tion with harm­ful side effects with­out “de­liber­ately search­ing” for it. The ques­tion is how bad these “hon­est mis­takes” would end up be­ing.

Paul: I also want to make the fur­ther claim that such failures are much less con­cern­ing than what-I’m-call­ing-al­ign­ment failures, which is a pos­si­ble dis­agree­ment we could dig into (I think Wei Dai dis­agrees or is very un­sure).