• I’d say that the claim is not suffi­cient—it might provide some al­ign­ment value, but it needs a larger story about how the whole com­pu­ta­tion is go­ing to be safe. I do think that the HCH frame­work could make spec­i­fy­ing an al­igned GOFAI-like com­pu­ta­tion eas­ier (but it’s hard to come up with a rigor­ous ar­gu­ment for this with­out point­ing to some kind of speci­fi­ca­tion that we can make claims about, which is some­thing I’d want to pro­duce along the way while pro­ceed­ing with HCH-like ap­proaches)

• I think a cleaner way of stat­ing con­di­tion 3 might be “there ex­ists some policy with a non-zero chance of de­cep­tion at some point in the fu­ture, with ex­pected value for the mesa-op­ti­mizer higher than the policy of always co­op­er­at­ing”.

This could be met if the threat of mod­ifi­ca­tion never goes away, but maybe de­creases over time. (at train­ing time, sys­tem is mon­i­tored for defec­tion ev­ery epi­sode, at a de­ploy­ment time it is mon­i­tored for defec­tion 1/​mil­lion epi­sodes, not worth­while to defect dur­ing train­ing but worth­while to defect at de­ploy­ment time)

This could fail to be met if the mesa-op­ti­mizer be­lieves that 1) there is a 1% chance that it will be de­tected by the base op­ti­mizer 2) if this hap­pens, the base op­ti­mizer will be able to de­ter­mine $O_mesa$ and give it a large nega­tive re­ward, 100x more than the pos­si­ble re­ward from the best defect­ing policy. (not sure if there’s any ver­sion of this that would work, since it seems hard to figure out $O_mesa$ and provide nega­tive re­ward)

• To me, It seems like the point of this story is that we could build an AI that ends up do­ing very dan­ger­ous things with­out ever ask­ing it “Will you do things I don’t like if given more ca­pa­bil­ity?” or some other similar ques­tion that re­quires it to ex­e­cute the treach­er­ous turn. In con­trast, if the de­vel­op­ers did some­thing like build a test­ing world with toy hu­mans in it who could be ma­nipu­lated in a way de­tectable to the de­vel­op­ers, and placed the AI in the toy test­ing world, then it seems like this AI would be forced into a po­si­tion where it ei­ther acts in a way ac­cord­ing to it’s true in­cen­tives (ma­nipu­late the hu­mans and be de­tected), or ex­e­cute the treach­er­ous turn (ab­stain from ma­nipu­lat­ing the hu­mans so de­vel­op­ers will trust it more). So it seems like this wouldn’t hap­pen if the de­vel­op­ers are try­ing to test for treach­er­ous turn be­havi­our dur­ing de­vel­op­ment.

• Are you in­ter­ested in pro­to­cols in­volv­ing mul­ti­ple epi­sodic ques­tions (where you ask one ques­tion, wait for it to re­solve, then ask an­other ques­tion?)

• Sub­mis­sion: low-band­width oracle

Plan Crit­i­cism: Given plan to build an al­igned AI, put to­gether a list of pos­si­ble lines of thought to think about prob­lems with the plan (open ques­tions, pos­si­ble failure modes, crit­i­cisms, etc.). Ask the or­a­cle to pick one of these lines of thought, pick an­other line of thought at ran­dom, and spend the next time pe­riod X think­ing about both, judge which line of thought was more use­ful to think about (where lines of thought that spot some fatal missed prob­lem are judged to be very use­ful) and re­ward the or­a­cle if its sug­ges­tion was picked.

• AI sys­tems end up con­trol­led by a group of hu­mans rep­re­sent­ing a small range of hu­man val­ues (ie. an ide­olog­i­cal or re­li­gious group that im­poses val­ues on ev­ery­one else). While not caused only by AI de­sign, it is pos­si­ble that de­sign de­ci­sions could im­pact the like­li­hood of this sce­nario (ie. at what point are val­ues loaded into the sys­tem/​how many peo­ple’s val­ues are loaded into the sys­tem), and is rele­vant for over­all strat­egy.

• Failure to learn how to deal with al­ign­ment in the many-hu­mans, many-AIs case even if sin­gle-hu­man, sin­gle-AI al­ign­ment is solved (which I think An­drew Critch has talked about). For ex­am­ple, AIs ne­go­ti­at­ing on be­half of hu­mans take the stance de­scribed in https://​​arxiv.org/​​abs/​​1711.00363 of agree­ing to split con­trol of the fu­ture ac­cord­ing to which hu­man’s pri­ors are most ac­cu­rate (on po­ten­tially ir­rele­vant is­sues) if this isn’t what hu­mans ac­tu­ally want.

• Maybe one AI philos­o­phy ser­vice could look like: would ask you a bunch of other ques­tions that are sim­pler than the prob­lem of qualia, then show you what those an­swers im­ply about the prob­lem of qualia if you use some method of rec­on­cil­ing those an­swers.

• Re: Philos­o­phy as in­ter­minable de­bate, an­other way to put the re­la­tion­ship be­tween math and philos­o­phy:

Philos­o­phy as weakly ver­ifi­able argumentation

Math is solv­ing prob­lems by look­ing at the con­se­quences of a small num­ber of ax­io­matic rea­son­ing steps. For some­thing to be math, we have to be able to ul­ti­mately cash out any proof as a se­ries of these rea­son­ing steps. Once some­thing is cashed out in this way, it takes a small con­stant amount of time to ver­ify any rea­son­ing step, so we can ver­ify given polyno­mial time.

Philos­o­phy is solv­ing prob­lems where we haven’t figured out a set of ax­io­matic rea­son­ing steps. Any non-ax­io­matic rea­son­ing step we pro­pose could end up hav­ing ar­gu­ments that we hadn’t thought of that would lead us to re­ject that step. And those ar­gu­ments them­selves might be un­der­mined by other ar­gu­ments, and so on. Each round of de­bate lets us add an­other level of counter-ar­gu­ments. Philoso­phers can make progress when they have some good pre­dic­tor of whether ar­gu­ments are good or not, but they don’t have ac­cess to cer­tain knowl­edge of ar­gu­ments be­ing good.

Another differ­ence be­tween math­e­mat­ics and philos­o­phy is that in math­e­mat­ics we have a well defined set of ob­jects and a well-defined prob­lem we are ask­ing about. Whereas in philos­o­phy we are try­ing to ask ques­tions about things that ex­ist in the real world and/​or we are ask­ing ques­tions that we haven’t crisply defined yet.

When we come up with a set of ax­ioms and a de­scrip­tion of a prob­lem, we can move that prob­lem from the realm of philos­o­phy to the realm of math­e­mat­ics. When we come up with some method we trust of ver­ify­ing ar­gu­ments (ie. repli­cat­ing sci­en­tific ex­per­i­ments), we can move prob­lems out of philos­o­phy to other sci­ences.

It could be the case that philos­o­phy grounds out in some rea­son­able set of ax­ioms which we don’t have ac­cess to now for com­pu­ta­tional rea­sons—in which case it could all end up in the realm of math­e­mat­ics. It could be the case that, for all prac­ti­cal pur­poses, we will never reach this state, so it will re­main in the “po­ten­tially un­bounded DEBATE round case”. I’m not sure what it would look like if it could never ground out—one model could be that we have a black box func­tion that performs a prob­a­bil­is­tic eval­u­a­tion of ar­gu­ment strength given counter-ar­gu­ments, and we go through some pro­cess to get the con­se­quences of that, but it never looks like “here is a set of ax­ioms”.

• I guess it feels like I don’t know how we could know that we’re in the po­si­tion that we’ve “solved” meta-philos­o­phy. It feels like the thing we could do is build a set of bet­ter and bet­ter mod­els of philos­o­phy and check their re­sults against held-out hu­man rea­son­ing and against each other.

I also don’t think we know how to spec­ify a ground truth rea­son­ing pro­cess that we could try to pro­tect and run for­ever which we could be com­pletely con­fi­dent would come up with the right out­come (where some­thing like HCH is a good can­di­date but po­ten­tially with bugs/​sub­tleties that need to be worked out).

I feel like I have some (not well jus­tified and pos­si­bly mo­ti­vated) op­ti­mism that this pro­cess yields some­thing good fairly early on. We could gain con­fi­dence that we are in this world if we build a bunch of bet­ter and bet­ter mod­els of meta-philos­o­phy and ob­serve at some point the mod­els con­tinue agree­ing with each other as we im­prove them, and that they agree with var­i­ous in­stan­ti­a­tions of pro­tected hu­man rea­son­ing that we run. If we are in this world, the thing we need to do is just spend some time build­ing a va­ri­ety of these kinds of mod­els and pro­duce an ac­tion that looks good to most of them. (Where agree­ment is not “comes up with the same an­swer” but more like “comes up with an an­swer that other mod­els think is okay and not dis­as­trous to ac­cept”).

Do you think this would lead to “good out­comes”? Do you think some ver­sion of this ap­proach could be satis­fac­tory for solv­ing the prob­lems in Two Ne­glected Prob­lems in Hu­man-AI Safety?

Do you think there’s a differ­ent kind of thing that we would need to do to “solve metaphilos­o­phy”? Or do you think that work­ing on “solv­ing metaphilos­o­phy” roughly caches out as “work on com­ing up with bet­ter and bet­ter mod­els of philos­o­phy in the model I’ve de­scribed here”?

• A cou­ple ways to im­ple­ment a hy­brid ap­proach with ex­ist­ing AI safety tools:

Log­i­cal In­duc­tion: Spec­ify some com­pu­ta­tion­ally ex­pen­sive simu­la­tion of ideal­ized hu­mans. Run a log­i­cal in­duc­tor with the de­duc­tive pro­cess run­ning the simu­la­tion and out­putting what the hu­mans say af­ter time x in simu­la­tion, as well as state­ments about what non-ideal­ized hu­mans are say­ing in the real world. The in­duc­tor should be able to provide be­liefs about what the ideal­ized hu­mans will say in the fu­ture in­formed by in­for­ma­tion from the non-ideal­ized hu­mans.

HCH/​IDA: The HCH-hu­mans demon­strate a rea­son­ing pro­cess which aims to pre­dict the out­put of a set of ideal­ized hu­mans us­ing all available in­for­ma­tion (which can in­clude run­ning simu­la­tions of ideal­ized hu­mans or in­for­ma­tion from real hu­mans). The way that the HCH tree us­ing in­for­ma­tion about real hu­mans in­volves look­ing care­fully at their cir­cum­stances and ask­ing things like “how do the real hu­man’s cir­cum­stances differ from the ideal­ized hu­man” and “is the in­for­ma­tion from the real hu­man com­pro­mised in some way?”

• It seems like for Filtered-HCH, the ap­pli­ca­tion in the post you linked to, you might be able to do a weaker ver­sion where you la­bel any com­pu­ta­tion that you can’t un­der­stand in kN steps as prob­le­matic, only ac­cept­ing things you think you can effi­ciently un­der­stand. (But I don’t think Paul is ar­gu­ing for this weaker ver­sion).

• RL is typ­i­cally about se­quen­tial de­ci­sion-mak­ing, and I wasn’t sure where the “se­quen­tial” part came in).

I guess I’ve used the term “re­in­force­ment learn­ing” to re­fer to a broader class of prob­lems in­clud­ing both one-shot ban­dit prob­lems and se­quen­tial de­ci­sion mak­ing prob­lems. In this view The fea­ture that makes RL differ­ent from su­per­vised learn­ing is not that we’re try­ing to figure out what how to act in an MDP/​POMDP, but in­stead that we’re try­ing to op­ti­mize a func­tion that we can’t take the deriva­tive of (in the MDP case, it’s be­cause the en­vi­ron­ment is non-differ­en­tiable, and in the ap­proval learn­ing case, it’s be­cause the over­seer is non-differ­en­tiable).

• Re: sce­nario 3, see The Evitable Con­flict, the last story in Isaac Asi­mov’s “I, Robot”:

“Stephen, how do we know what the ul­ti­mate good of Hu­man­ity will en­tail? We haven’t at our dis­posal the in­finite fac­tors that the Ma­chine has at its! Per­haps, to give you a not un­fa­mil­iar ex­am­ple, our en­tire tech­ni­cal civ­i­liza­tion has cre­ated more un­hap­piness and mis­ery than it has re­moved. Per­haps an agrar­ian or pas­toral civ­i­liza­tion, with less cul­ture and less peo­ple would be bet­ter. If so, the Machines must move in that di­rec­tion, prefer­ably with­out tel­ling us, since in our ig­no­rant prej­u­dices we only know that what we are used to, is good – and we would then fight change. Or per­haps a com­plete ur­ban­iza­tion, or a com­pletely caste-rid­den so­ciety, or com­plete an­ar­chy, is the an­swer. We don’t know. Only the Machines know, and they are go­ing there and tak­ing us with them.”
• Yeah, to some ex­tent. In the Lookup Table case, you need to have a (po­ten­tially quite ex­pen­sive) way of re­solv­ing all mis­takes. In the Overseer’s Man­ual case, you can also lev­er­age hu­mans to do some kind of more ro­bust rea­son­ing (for ex­am­ple, they can no­tice a typo in a ques­tion and still re­spond cor­rectly, even if the Lookup Table would fail in this case). Though in low-band­width over­sight, the space of things that par­ti­ci­pants could no­tice and cor­rect is fairly limited.

Though I think this still differs from HRAD in that it seems like the out­put of HRAD would be a much smaller thing in terms of de­scrip­tion length than the Lookup Table, and you can buy ex­tra ro­bust­ness by adding many more hu­man-rea­soned things into the Lookup Table (ie. au­to­mat­i­cally add ver­sions of all ques­tions with ty­pos that don’t change the mean­ing of a ques­tion into the Lookup Table, add 1000 differ­ent san­ity check ques­tions to flag that things can go wrong).

So I think there are ad­di­tional ways the sys­tem could cor­rect mis­taken rea­son­ing rel­a­tive to what I would think the out­put of HRAD would look like, but you do need to have pro­cesses that you think can cor­rect any way that rea­son­ing goes wrong. So the prob­lem could be less difficult than HRAD, but still tricky to get right.

• Thanks, this po­si­tion makes more sense in light of Beyond Astro­nom­i­cal Waste (I guess I have some con­cept of “a pretty good fu­ture” that is fine with some­thing like a bunch of hu­man-de­scended be­ings liv­ing a happy lives that misses out on the sort of things men­tioned in Beyond Astro­nom­i­cal Waste, and “op­ti­mal fu­ture” which in­cludes those con­sid­er­a­tions). I buy this as an ar­gu­ment that “we should put more effort into mak­ing philos­o­phy work to make the out­come of AI bet­ter, be­cause we risk los­ing large amounts of value” rather than “our efforts to get a pretty good fu­ture are doomed un­less we make tons of progress on this” or some­thing like that.

“Thou­sands of mil­lions” was a typo.

• What is the mo­ti­va­tion for us­ing RL here?

I see the mo­ti­va­tion as given prac­ti­cal com­pute limits, it may be much eas­ier to have the sys­tem find an ac­tion the over­seer ap­proves of in­stead of imi­tat­ing the over­seer di­rectly. Us­ing RL also al­lows you to use any ad­vances that are made in RL by the ma­chine learn­ing com­mu­nity to try to re­main com­pet­i­tive.

• Would this still be a prob­lem if we were train­ing the agent with SL in­stead of RL?

Maybe this could hap­pen with SL if SL does some kind of large search and finds a solu­tion that looks good but is ac­tu­ally bad. The dis­til­led agent would then learn to iden­tify this ac­tion and re­pro­duce it, which im­plies the agent learn­ing some facts about the ac­tion to effi­ciently lo­cate it with much less com­pute than the large search pro­cess. Know­ing what the agent knows would al­low the over­seer to learn those facts, which might help in iden­ti­fy­ing this ac­tion as bad.

• I don’t un­der­stand why we want to find this X* in the imi­ta­tion learn­ing case.

Ah, with this ex­am­ple the in­tent was more like “we can frame what the RL case is do­ing as find­ing X* , let’s show how we could ac­com­plish the same thing in the imi­ta­tion learn­ing case (in the limit of un­limited com­pute)”.

The re­verse map­ping (imi­ta­tion to RL) just con­sists of ap­ply­ing re­ward 1 to M2′s demon­strated be­havi­our (which could be “ex­e­cute some safe search and re­turn the re­sults), and re­ward 0 to ev­ery­thing else.

What is pM(X∗)?

is the prob­a­bil­ity of out­putting (where is a stochas­tic policy)

M2(“How good is an­swer X to Y?”)∗∇log(pM(X))

This is the REINFORCE gra­di­ent es­ti­ma­tor (which tries to in­crease the log prob­a­bil­ity of ac­tions that were rated highly)

• I guess the ques­tion was more from the per­spec­tive of: if the cost was zero then it seems like it would worth run­ning, so what part of the cost makes it not worth run­ning (where I would think of cost as prob­a­bly time to judge or availa­bil­ity of money to fund the con­test).