False thermodynamic miracles

A pu­ta­tive new idea for AI con­trol; in­dex here. See also Utility vs Prob­a­bil­ity: idea syn­the­sis.

Ok, here is the prob­lem:

• You have to cre­ate an AI that be­lieves (or acts as if it be­lieved) that event X is al­most cer­tain, while you be­lieve that X is al­most im­pos­si­ble. Fur­ther­more, you have to be right. To make things more in­ter­est­ing, the AI is much smarter than you, knows ev­ery­thing that you do (and more), and has to re­act sen­si­bly when event X doesn’t hap­pen.

An­swers will be graded on math­e­mat­ics, style, colours of ink, and com­pat­i­bil­ity with the laws of physics. Also, pen­man­ship. How could you achieve this?

Acts and beliefs

The main trick is the par­en­thet­i­cal “(or acts as if it be­lieved)”. If the AI ac­tu­ally has true be­liefs, then there is no prob­lem with it be­ing smart, be­ing knowl­edge­able, or up­dat­ing on the fact that X didn’t hap­pen. So the prob­lem re­duces to:

• How can an AI, that be­lieves that X is al­most im­pos­si­ble, be­have as if X were al­most cer­tain?

This will be con­structed by de­vel­op­ing on my idea for “util­ity in­differ­ence” (that old con­cept needs a bet­ter name, btw; cor­rigi­bil­ity is a much bet­ter can­di­date for be­ing la­bel­led “util­ity in­differ­ence”).

Noisy events and ther­mo­dy­namic miracles

Imag­ine that a sig­nal is trav­el­ling down a wire in­side a noisy gas (or some other source of ran­dom­ness). Al­most cer­tainly the sig­nal will get through, but very oc­ca­sion­ally, by the spon­ta­neous ac­tion of the molecules in the gas, the sig­nal will get over­whelmed, erased and over­writ­ten with an­other sig­nal, ran­domly. This could be seen as a type of ther­mo­dy­namic mir­a­cle: an in­cred­ibly un­likely event caused by spon­ta­neous and co­in­ci­den­tal co­or­di­na­tion be­tween lots of differ­ent molecules. We as­sume here that the AI can­not cause a ther­mo­dy­namic mir­a­cle, nor make it more likely (this may re­quire some spe­cific defi­ni­tions of what “counts” as a mir­a­cle).

The idea is to make the AI in­differ­ent to any world in which there is no ther­mo­dy­namic mir­a­cle. So give it a util­ity u such that u(w)=con­stant, for any world w with­out a ther­mo­dy­namic mir­a­cle at the point of in­ter­est. The value of that con­stant doesn’t mat­ter; the re­sult is that the AI’s ac­tions have no im­pact on util­ity in any world where the mir­a­cle didn’t hap­pen. Thus, as long as there is a tiny chance that the mir­a­cle hap­pened, the AI will be­have as if that was a cer­tainty: for only in those wor­lds do its ac­tions have any im­pact.

You can prob­a­bly see where I’m go­ing with this. Sup­pose the sig­nal goes through the wire to turn on an­other AI. That AI will only turn on given a very spe­cific 100-digit “ON” bi­nary code. Given any other code, the sec­ond AI won’t turn on, and will erase it­self. We send the “ON” code.

So what hap­pens? The first AI rea­sons that given a ther­mo­dy­namic mir­a­cle (the only wor­lds it cares about) the out­put sig­nal is go­ing to be 100 ran­dom bits, which have only 2-100 chance of be­ing “ON”. There­fore it will be­have as if it’s al­most cer­tain “ON” will not get through the gas.

And what will (al­most cer­tainly) hap­pen in re­al­ity? We send the “ON” sig­nal, and it gets through un­changed. The first AI ob­serves this (in fact, it ex­pected this). How­ever, the only pos­si­ble world that it cares about, now, is one in which the “ON” sig­nal was over­writ­ten by a ther­mo­dy­namic mir­a­cle… into ex­actly the same “ON” sig­nal. As long as the pos­si­bil­ity of this is non-zero, the AI will be­have as if it’s cer­tain.

We should de­velop this point a lit­tle. The AI is al­most cer­tain, ahead of time, that “ON” will hap­pen. It be­haves, how­ever, as if “ON” is al­most cer­tain not to hap­pen. Then, af­ter “ON” has gone through, it must adapt it­self and back­track on its pre­vi­ous be­havi­our. More­over, it knows that this is what it will (al­most cer­tainly) do. Since it knows ahead of time it will be back­track­ing, why does it still go ahead with its be­havi­our?

The an­swer is that most of the uni­verses in which it back­tracks, are uni­verses where its ac­tions have no con­se­quences, and hence it doesn’t care about them. So it might have 1-ε chance of back­track­ing, but it only cares about a pro­por­tion ε(2-100) of those wor­lds. On the other hand, it won’t back­track with prob­a­bil­ity ε(1-2-100), but it cares about those wor­lds com­pletely.

False Ther­mo­dy­namic Miracles

This is the ex­pla­na­tion for the ti­tle: the AI will be­have as if there was a real ther­mo­dy­namic mir­a­cle (a spon­ta­neous over­writ­ing of a sig­nal with the same sig­nal) while we “know” that ac­tu­ally it was just the same sig­nal go­ing through.

By vary­ing what we put at the other end of the wire, we can get the AI to acts “as if” it be­lieved al­most any­thing, with­out in­tro­duc­ing prob­le­matic be­liefs. We just have to be a bit cau­tious with our se­tups, by con­sid­er­ing how hu­mans would re­act if there was a real ther­mo­dy­namic mir­a­cle (be­cause the AI is cer­tainly go­ing to be con­sid­er­ing that very care­fully).

This al­lows us to con­sider AI de­signs with coun­ter­fac­tu­als, false “be­liefs”, etc..., which will be of great use. Are there any flaws I’m miss­ing? Prac­ti­cal flaws are ok (eg how do you define a ther­mo­dy­namic mir­a­cle and make sure the AI can’t af­fect it), but the­o­ret­i­cal flaws are much more im­por­tant at this stage.

In equations

Edit: (thanks to TylerJay)

In equa­tion form, the AI is maximising

P(¬X)*C + P(X)*u(X, A),

for some con­stant C, some un­likely event X that the AI can­not af­fect, some set of rele­vant de­scrip­tors A, and some util­ity u. Since C is con­stant, this is ex­actly the same as max­imis­ing u(X, A) - the prob­a­bil­ity P(X) is ir­rele­vant.

The whole setup de­scribed is sim­ply a way to en­sure that if W is the likely set of wor­lds con­sis­tent with ob­ser­va­tions af­ter ¬X/​X, then

P(W) ≈ P(¬X) ≈ 1 (we “know” that X doesn’t hap­pen and that we end up in W),

while

P(W|X) << 1 (in the wor­lds it cares about, the AI be­haves as if W was in­cred­ibly un­likely to come about).

Same re­sult, computationally

The above was a “phys­i­cal” de­sign, re­volv­ing around a par­tic­u­lar phys­i­cal setup, plus some value hack­ing. But we can do the same in a purely com­pu­ta­tional way, by adapt­ing dr­nick­bone’s de­sign.

To adapt that de­sign, as­sume Ω will run UDT/​TDT agent A to get its es­ti­ma­tion as to the likely prob­a­bil­ity of any one of 2100 se­quences of bits com­ing down the wire. It will then choose to send the sig­nal that A as­signs the low­est prob­a­bil­ity to. If there is a tie, it will choose the first se­quence, lex­i­co­graph­i­cally.

It seems that A must as­sign equal prob­a­bil­ity 2-100 to each se­quence (as any in­crease in prob­a­bil­ity makes that se­quence less likely to ap­pear, and any de­crease makes it more likely). There­fore, we “know” that the sig­nal will be “0000...0000″, but the AI can­not as­sign a higher prob­a­bil­ity to that. This achieves the same out­come as above.

I per­son­ally feel the phys­i­cal de­sign is more se­cure—it seems eas­ier to define what a ther­mo­dy­namic mir­a­cle at a given point in space­time would be, than to be cer­tain that the AI will not evade the com­pu­ta­tional re­stric­tions some­how, or that the com­pu­ta­tional setup will re­main sta­ble and well defined. Also, A may not spread ex­actly equal prob­a­bil­ity over 2100 se­quences, mess­ing up our cer­tainty as to the re­sult.

• I’m un­easy about this for similar rea­sons to why I was un­easy about util­ity in­differ­ence. If the AI col­lects com­pel­ling ev­i­dence that a ther­mo­dy­namic mir­a­cle did not oc­cur, then it is pos­si­ble that the hy­pothe­ses left in which the ther­mo­dy­namic mir­a­cle did oc­cur will be dom­i­nated by strange, com­pli­cated hy­pothe­ses (e.g. the ex­is­tence of some sort of Carte­sian de­mon try­ing to trick the AI into think­ing that the ther­mo­dy­namic mir­a­cle oc­curred), and the AI’s be­hav­ior may be­come er­ratic as a re­sult.

• Yes, this is a con­cern. But it seems a solv­able con­cern, once we have the prin­ci­ples right (and it al­most cer­tainly won’t be im­ple­mented as an ac­tual wire in a gas cloud setup).

• This is re­ally in­ter­est­ing. I thought I un­der­stood it and I wanted to ver­ify that by try­ing to sum­ma­rize it (and maybe help oth­ers too) but now I’m not so sure…

Edit: Just to save any­body the read­ing time, my rea­son­ing is false be­low. After sleep­ing on it, I see my mis­take. Noth­ing be­low the “False Ther­mo­dy­namic Mir­a­cles” sub­head­ing made sense to me yes­ter­day be­cause I thought the pur­pose of the setup was to have an “off switch” on the simu­lated AI un­der the false be­lief (let­ting it see the re­sult of the sig­nal af­ter some time pe­riod). I get it now though. Max­i­miz­ing “[P(no mir­a­cle) C] + [P(mir­a­cle) u(A) given mir­a­cle]” is the same as max­i­miz­ing “u(A) given mir­a­cle”. So the AI will act as if the mir­a­cle hap­pened, be­cause there’s no cost as­so­ci­ated with those ac­tions if it didn’t hap­pen—only benefits if it did.

As I un­der­stand it, the ba­sic premise is that the AI has a util­ity func­tion u(w,A) where “w” is whether or not the spe­cific ther­mo­dy­namic mir­a­cle (TM) oc­curred and A rep­re­sents a reg­u­lar/​nor­mal in­put to a util­ity func­tion (state of the uni­verse, ac­tion taken, or what­ever).

u(w,A) = C (a low con­stant) if w is false (TM did not hap­pen), and u(w,A) is es­sen­tially “u(A) given w” when w is true (ie. it’s a “nor­mal” kind of util­ity func­tion, dom­i­nated by things other than the TM, though it still has to in­cor­po­rate the fact that the TM hap­pened into its model of the world).

So, the sig­nal is sent, and it calcu­lates its ex­pected util­ity for ei­ther sce­nario. If w = false, u(w,A) = a low con­stant, and if w = true, it de­ter­mines an ac­tion A that max­i­mizes the util­ity func­tion u(A)|w. As long as 2^-100 u( A ) > C, it starts tak­ing ac­tion A.

Is this right so far?

Then, af­ter “ON” has gone through, it must adapt it­self and back­track on its pre­vi­ous be­havi­our.

This mechanism seems un­der­speci­fied to me (though maybe it’s ob­vi­ous to oth­ers). Why would it back­track (or what do you mean by back­track)? Even­tu­ally, it ob­serves that w = false (that “ON” went through un­changed) and that its ac­tions are no longer benefi­cial, so it just stops do­ing any­thing, right? The pro­cess ter­mi­nates or it goes to standby?

If we want the AI to ac­tu­ally back­track, wouldn’t we need to mod­ify the util­ity func­tion so that it takes a con­figu­ra­tion pa­ram­e­ter as well? u(w,A,X) s.t. X = true iff the AI is in its ini­tial con­figu­ra­tion, and set it so that u(F,A,F) = 0, but u(F,A,T) = C ? (this prob­a­bly counts as just an im­ple­men­ta­tion de­tail, but I figured I’d men­tion it any­way)

The part that makes me feel like I didn’t ac­tu­ally un­der­stand it at all is:

How­ever, the only pos­si­ble world that it cares about, now, is one in which the “ON” sig­nal was over­writ­ten by a ther­mo­dy­namic mir­a­cle… into ex­actly the same “ON” sig­nal. As long as the pos­si­bil­ity of this is non-zero, the AI will be­have as if it’s cer­tain.

Can you ex­plain this a bit more? Do you mean that even af­ter it ob­serves “ON”, it doesn’t stop act­ing as if w = true? That P(TM that over­writes ON with ON) * u(A)|w > C ? If that’s the case, then it would never back­track, right? So it’s es­sen­tially a full simu­la­tion of an AI un­der the as­sump­tion w, but with the knowl­edge that w is in­cred­ibly un­likely, and no built-in halt­ing con­di­tion?

Thanks

• C need not be a low con­stant, btw. The only re­quire­ment is that u(false,ac­tion a, A) = u(false, ac­tion b, A) for all ac­tions a and b and all A. ie noth­ing the AI does af­fects the util­ity of wor­lds where w is false, so this does not con­strain its ac­tions.

Ba­si­cally the AI ob­serves the ON sig­nal go­ing through, and knows that ei­ther a) the sig­nal went through nor­mally, or b) the sig­nal was over­writ­ten by co­in­ci­dence by ex­actly the same sig­nal. It’s ac­tions have no con­se­quences in the first case, so it ig­nores it, and acts “as if” it was cer­tain there had been a ther­mo­dy­namic mir­a­cle that hap­pened.

• Thanks. I un­der­stand now. Just needed to sleep on it, and to­day, your ex­pla­na­tion makes sense.

Ba­si­cally, the AI’s ac­tions don’t mat­ter if the un­likely event doesn’t hap­pen, so it will take what­ever ac­tions would max­i­mize its util­ity if the event did hap­pen. This max­i­mizes ex­pected util­ity

Max­i­miz­ing [P(no TM) C + P(TM) u(TM, A))] is the same as max­i­miz­ing u(A) un­der as­sump­tion TM.

• Max­i­miz­ing [P(no TM) C + P(TM) u(TM, A))] is the same as max­i­miz­ing u(A) un­der as­sump­tion TM.

Yes, that’s a clear way of phras­ing it.

• Why would it back­track (or what do you mean by back­track)? Even­tu­ally, it ob­serves that w = false (that “ON” went through un­changed) and that its ac­tions are no longer benefi­cial, so it just stops do­ing any­thing, right? The pro­cess ter­mi­nates or it goes to standby?

I think the pre­sump­tion is that the case where the “ON” sig­nal goes thru nor­mally and the case where the “ON” sig­nal is over­writ­ten by a ther­mo­dy­namic mir­a­cle… into ex­actly the same “ON” sig­nal are equiv­a­lent. That is that af­ter the “ON” sig­nal has gone though the AI would be­have iden­ti­cally to an AI that was not in­differ­ent to wor­lds where the ther­mo­dy­namic mir­a­cle did not oc­cur.

The rea­son for this is that al­though the chance that the “ON” sig­nal was over­writ­ten into ex­actly the same “ON” sig­nal is tiny, it is the only re­main­ing pos­si­ble world that the AI cares about so it will act as if that is what it be­lieves.

• I am fairly con­fi­dent that I un­der­stand your in­ten­tions here. A quick sum­mary, just to test my­self:

HAL cares only about world states in which an ex­tremely un­likely ther­mo­dy­namic even oc­curs- namely, the world in which one hun­dred ran­dom bits are gen­er­ated spon­ta­neously dur­ing a spe­cific time in­ter­val. HAL is perfectly aware that these are un­likely events, but can­not act in such a way as to make the event more likely. HAL will there­fore in­crease to­tal util­ity over all pos­si­ble wor­lds where the un­likely even oc­curs, and oth­er­wise ig­nore the con­se­quences of its choices.

This time in­ter­val cor­re­sponds by de­sign with an ac­tual sig­nal be­ing sent. HAL ex­pects the sig­nal to be sent, with a very small chance that it will be over­writ­ten by spon­ta­neously gen­er­ated bits and thus be one of the words where it wants to max­i­mize util­ity. Within the do­main of world states that the ma­chine cares about, the string of bits is ran­dom. There is a string among all these wor­lds states that cor­re­sponds to the sig­nal, but it is the world where that sig­nal is gen­er­ated ran­domly by the spon­ta­neously gen­er­ated bits. Thus, within the do­main of in­ter­est to HAL, the sig­nal is ex­tremely un­likely, whereas within all do­mains known to HAL, the sig­nal is ex­tremely likely to oc­cur by means of not be­ing over­writ­ten in the first place. There­fore, the ma­chine’s be­hav­ior will treat the ac­tual sig­nal in a coun­ter­fac­tual way de­spite HAL’s ob­ject-level knowl­edge that the sig­nal will oc­cur with high prob­a­bil­ity.

If that’s cor­rect, then it seems like a very in­ter­est­ing pro­posal!

I do see at least one differ­ence be­tween this setup, and a le­gi­t­i­mate coun­ter­fac­tual be­lief. In par­tic­u­lar, you’ve got to worry about be­hav­ior in which (1-ep­silon)% of all pos­si­ble wor­lds have a con­stant util­ity. It may not be strictly equiv­a­lent to the sim­ple coun­ter­fac­tual be­lief. Sup­pose, in a pre­pos­ter­ous ex­am­ple, that there ex­ists some de­vice which marginally in­creases your abil­ity to de­tect ther­mo­dy­namic mir­a­cles (or oth­er­wise in­creases your util­ity dur­ing such a mir­a­cle); un­for­tu­nately, if no ther­mo­dy­namic mir­a­cle is de­tected, it ex­plodes and de­stroys the Earth. If you sim­ply be­lieve in the usual way that a ther­mo­dy­namic mir­a­cle is very likely to oc­cur, you might not want to use the de­vice, since it’s got catas­trophic con­se­quences for the world where your ex­pec­ta­tion is false. But if the non-mirac­u­lous world states are sim­ply ir­rele­vant, then you’d hap­pily use the de­vice.

As I think about it, I think maybe the real weird­ness comes from the fact that your AI doesn’t have to worry about the pos­si­bil­ity of it be­ing wrong about there hav­ing been a ther­mo­dy­namic mir­a­cle. If it re­sponds to the false be­lief that a ther­mo­dy­namic mir­a­cle has oc­curred, there can be no nega­tive con­se­quences.

It can ac­count for the ‘min­i­mal’ prob­a­bil­ity that the sig­nal it­self oc­curs, of course- that’s in­cluded in the ‘ep­silon’ do­main of wor­lds that it cares about. But when the sig­nal went through, the AI would not nec­es­sar­ily be act­ing in a rea­son­able way on the prob­a­bil­ity that this was a non-mirac­u­lous event.

• Yep, that’s pretty much it.

• To adapt that de­sign, as­sume Ω will run UDT/​TDT agent A to get its es­ti­ma­tion as to the likely prob­a­bil­ity of any one of 2^100 se­quences of bits com­ing down the wire. It will then choose to send the sig­nal that A as­signs the low­est prob­a­bil­ity to. If there is a tie, it will choose the first se­quence, lex­i­co­graph­i­cally.

Any prob­a­bil­ity mass at all as­signed to the hy­poth­e­sis that the wire’s se­quence doesn’t fol­low this rule, ends up break­ing the scheme. Also, I think the AI can an­swer with prob­a­bil­ity dis­tri­bu­tions that are well defined, but for which this se­lec­tion is in­tractable.

• Could you maybe add some more ex­pla­na­tion of how the stated prob­lem is rele­vant for AI con­trol? It’s not ob­vi­ous to me from the out­set why I care about dup­ing an AI.

• Many ap­proaches can be used if you can use coun­ter­fac­tu­als or “false” in­for­ma­tion in the AI. Such as an AI that doesn’t “be­lieve” that a par­tic­u­lar trig­ger is armed, and then gets caught by that trig­ger as it defects with­out first neau­tral­is­ing it.

There’s a lot of stuff com­ing that uses that, im­plic­itly or ex­plic­itly. See http://​​less­wrong.com/​​lw/​​lt6/​​newish_ai_con­trol_ideas/​​

• One high-level con­cern. If I’m read­ing this right (and please let me know if I’m not), then this is talk­ing about han­dling coun­ter­fac­tu­als by spawn­ing a copy of the AI with a differ­ent util­ity func­tion.

Just on gen­eral prin­ci­ples, spawn­ing a copy of the AI with a differ­ent util­ity func­tion seems re­ally, re­ally dan­ger­ous. The new copy would be mo­ti­vated to trade off any­thing at all in the no-ther­mo­dy­namic-mir­a­cle sce­nario in or­der to in­crease util­ity in the event of a ther­mo­dy­namic mir­a­cle. In par­tic­u­lar, if the AI were boxed (as we might ex­pect for coun­ter­fac­tual pro­cess­ing) then it would pre­sum­ably try to get out of that box.

• Here’s my ex­pla­na­tion of it. Let me know if this helps with your con­cerns at all:

Imag­ine we have an AI de­sign we want to test. Call this AI C3PO, and let its util­ity func­tion be U(A) where A is a world-state from the set of all pos­si­ble world-states. And let the su­per-un­likely-event-hap­pen­ing-at-the-speci­fied-time de­scribed in the post be w such that w = true if it hap­pens and w = false if it doesn’t hap­pen. Then let A be a world state in the sub­set of all world-states A in which w = true. Ba­si­cally, A is A given that w hap­pened (this is how we simu­late a “false be­lief” by only al­low­ing the AI to con­sider wor­lds in which w = true). Fi­nally, let C be a con­stant.

The pro­posal is that we cre­ate a var­i­ant of C3PO, C3PO* that has the util­ity func­tion:

U*(A) = P(!w) * C + P(w) * (U(A*))

If the AI is boxed such that it can­not af­fect the prob­a­bil­ity of w oc­cur­ring and it can­not mod­ify its own util­ity func­tion, then max­i­miz­ing U is ex­actly the same as max­i­miz­ing U once event w has oc­curred (ie. with false be­lief w). In this way, we are able to perfectly simu­late C3P0 to find out what it would do if w were true, but we don’t ac­tu­ally have to con­vince it that w* is true.

• Right, that much makes sense. The prob­lem is the “perfectly simu­late C3PO” part to­ward the end. If we re­ally want to see what it would do, then we need a perfect simu­la­tion of the en­vi­ron­ment in ad­di­tion to C3PO it­self. Any im­perfec­tion, and C3PO might re­al­ize it’s in a simu­lated en­vi­ron­ment. All else equal, once C3PO* knows it’s in a simu­lated en­vi­ron­ment, it would pre­sum­ably try to get out. Since its util­ity func­tion is differ­ent from C3PO, it would some­times be mo­ti­vated to un­der­mine C3PO (or us, if we’re the ones run­ning the simu­la­tion).

• Just re­mem­ber that this isn’t a box­ing setup. This is just a way of see­ing what an AI will do un­der a false be­lief. From what I can tell, the con­cerns you brought up about it try­ing to get out isn’t any differ­ent be­tween the sce­nario when we simu­late C3PO* and when we simu­late C3PO. The prob­lem of mak­ing a simu­la­tion in­dis­t­in­guish­able from re­al­ity is a sep­a­rate is­sue.

• One way to make AI to be­lieve in a claim that we know is false, is a situ­a­tion than dis­prov­ing the claim re­quires much more com­pu­ta­tional com­plex­ity that sug­gest­ing it. Fo ex­am­ple, it is very cheap thought for me to sug­gest that we live in a vast com­puter simu­la­tion with prob­a­bil­ity 0.5. But to dis­prove this claim with prob­a­bil­ity one, AI may need a lot of think­ing, may be more than to­tal max­i­mum pos­si­ble com­pu­ta­tional power of the uni­verse al­lows.

• What if the ther­mo­dy­namic mir­a­cle has no effect on the util­ity func­tion be­cause it oc­curs el­se­where? Tak­ing the same ex­am­ple, the AI simu­lates send­ing the sig­nal down the ON wire… and it passes through, but the 0s that came af­ter the sig­nal is mirac­u­lously turned into 0s.

This way the AI does in­deed care about what hap­pens in this uni­verse. As­sum­ing that AI wants to turn on the sec­ond AI, the AI could have sent an­other sig­nal down the ON wire, and then end up simu­lat­ing failure due to any kind of ther­mo­dy­namic mir­a­cle, or it could have sent the ON sig­nal, and ALSO simu­late suc­cess, but only when the ther­mo­dy­namic mir­a­cle ap­pears af­ter the last bit is trans­mit­ted (or be­fore the first bit is trans­mit­ted), so it no longer be­haves as if it be­lieves send­ing a sig­nal down the wire ac­com­plishes any­thing at all, but in­stead that send­ing a sig­nal down the wire has a higher util­ity.

This prob­a­bly means that I don’t un­der­stand what you mean… How does this prob­lem not arise in the model you have in your head?

• What if the ther­mo­dy­namic mir­a­cle has no effect on the util­ity func­tion be­cause it oc­curs el­se­where?

Where it oc­curs, and other such cir­cum­stances and re­stric­tions, need to be part of the defi­ni­tion for this setup.

• This is ba­si­cally tel­ling the AI that it should ac­cept a Pas­cal’s Wager.

• Not re­ally. There is no huge ex­pected util­ity re­ward to com­pen­sate the low prob­a­bil­ity, and the setup is very spe­cific (not a gen­eral “ac­cept pas­cal wa­gers”).

• I’m ner­vous about de­sign­ing elab­o­rate mechanisms to trick an AGI, since if we can’t even cor­rectly im­ple­ment an or­di­nary friendly AGI with­out bugs and mis­takes, it seems even less likely we’d im­ple­ment the weird/​clever AGI se­tups with­out bugs and mis­takes. I would tend to fo­cus on just get­ting the AGI to be­have prop­erly from the start, with­out need for clever tricks, though I sup­pose that limited ex­plo­ra­tion into more fan­ciful sce­nar­ios might yield in­sight.

• The AGI does not need to be tricked—it knows ev­ery­thing about the setup, it just doesn’t care. The point of this is that it al­lows a lot of ex­tra con­trol meth­ods to be con­sid­ered, if friendli­ness turns out to be as hard as we think.

• Fair enough. I just meant that this setup re­quires build­ing an AGI with a par­tic­u­lar util­ity func­tion that be­haves as ex­pected and build­ing ex­tra ma­chin­ery around it, which could be more com­pli­cated than just build­ing an AGI with the util­ity func­tion you wanted. On the other hand, maybe it’s eas­ier to build an AGI that only cares about wor­lds where one par­tic­u­lar bit­string shows up than to build a friendly AGI in gen­eral.

• One naive and use­ful se­cu­rity pre­cau­tion is to only make the AI care about world where the high ex­plo­sives in­side it won’t ac­tu­ally ever deto­nate… (and place some­one ready to blow them up if the AI mis­be­haves).

There are other, more gen­eral ver­sions of that idea, and other uses to which this can be put.

• I guess you mean that the AGI would care about wor­lds where the ex­plo­sives won’t deto­nate even if the AGI does noth­ing to stop the per­son from press­ing the deto­na­tion but­ton. If the AGI only cared about wor­lds where the bomb didn’t deto­nate for any rea­son, it would try hard to stop the but­ton from be­ing pushed.

But to make the AGI care about only wor­lds where the bomb doesn’t go off even if it does noth­ing to avert the ex­plo­sion, we have to define what it means for the AGI to “try to avert the ex­plo­sion” vs. just do­ing or­di­nary ac­tions. That gets pretty tricky pretty quickly.

Any­way, you’ve con­vinced me that these sce­nar­ios are at least in­ter­est­ing. I just want to point out that they may not be as straight­for­ward as they seem once it comes time to im­ple­ment them.

• we have to define what it means for the AGI to “try to avert the ex­plo­sion” vs. just do­ing or­di­nary ac­tions. That gets pretty tricky pretty quickly.

We don’t ac­tu­ally have to do that. We set it up so the AI only cares about wor­lds in which a cer­tain wire in the deto­na­tor doesn’t pass the sig­nal through, so the AI has no need to act to re­move the ex­plo­sives or pre­vent the but­ton from be­ing pushed. Now, it may do those for other rea­sons, but not speci­fi­cally to pro­tect it­self.

Or an­other ex­am­ple: an or­a­cle that only cares about wor­lds in which its out­put mes­sage is not read: http://​​less­wrong.com/​​r/​​dis­cus­sion/​​lw/​​mao/​​an_or­a­cle_stan­dard_trick/​​

• Couldn’t you just send one bit X (1 means on, 0 means off) which is most likely 1 but could turn into 0 due to noise and define the util­ity u* in the same way as for cor­rigi­bil­ity? That is,

u*(A_1,0,A_2)= u(A_1,0,A_2)

u*(A_1,1,A_2)=u(A_1,1,A_2)+E_{A_2′} u(A_1,0,A_2′)- E_{A_2′} u(A_1,1,A_2′)

Here A_1 de­notes what hap­pens in the world be­fore the sig­nal is sent, and A_2 what hap­pens af­ter­wards. This way you only use 1 bit rather than 100 and there is no longer a con­tri­bu­tion of 2^{-100} from the case where there is a ther­mo­dy­namic mir­a­cle that turns the on-sig­nal into the on-sig­nal (and you don’t have to worry about the dis­tri­bu­tion of the sig­nal given a ther­mo­dy­namic mir­a­cle). The or­a­cle will op­ti­mize u given that X=0 un­til X is re­vealed. When that is re­vealed, we will most likely have X=1, and the or­a­cle will op­ti­mize u given X=1 (if the or­a­cle is still run­ning). Does the above idea achieve some­thing more?

• This seems to work. The differ­ence is how the rev­e­la­tion of X is han­dled. I’m not sure which ap­proach is bet­ter, or in which cir­cum­stances.