False thermodynamic miracles

A pu­ta­tive new idea for AI con­trol; in­dex here. See also Utility vs Prob­a­bil­ity: idea syn­the­sis.

Ok, here is the prob­lem:

  • You have to cre­ate an AI that be­lieves (or acts as if it be­lieved) that event X is al­most cer­tain, while you be­lieve that X is al­most im­pos­si­ble. Fur­ther­more, you have to be right. To make things more in­ter­est­ing, the AI is much smarter than you, knows ev­ery­thing that you do (and more), and has to re­act sen­si­bly when event X doesn’t hap­pen.

An­swers will be graded on math­e­mat­ics, style, colours of ink, and com­pat­i­bil­ity with the laws of physics. Also, pen­man­ship. How could you achieve this?

Acts and beliefs

The main trick is the par­en­thet­i­cal “(or acts as if it be­lieved)”. If the AI ac­tu­ally has true be­liefs, then there is no prob­lem with it be­ing smart, be­ing knowl­edge­able, or up­dat­ing on the fact that X didn’t hap­pen. So the prob­lem re­duces to:

  • How can an AI, that be­lieves that X is al­most im­pos­si­ble, be­have as if X were al­most cer­tain?

This will be con­structed by de­vel­op­ing on my idea for “util­ity in­differ­ence” (that old con­cept needs a bet­ter name, btw; cor­rigi­bil­ity is a much bet­ter can­di­date for be­ing la­bel­led “util­ity in­differ­ence”).

Noisy events and ther­mo­dy­namic miracles

Imag­ine that a sig­nal is trav­el­ling down a wire in­side a noisy gas (or some other source of ran­dom­ness). Al­most cer­tainly the sig­nal will get through, but very oc­ca­sion­ally, by the spon­ta­neous ac­tion of the molecules in the gas, the sig­nal will get over­whelmed, erased and over­writ­ten with an­other sig­nal, ran­domly. This could be seen as a type of ther­mo­dy­namic mir­a­cle: an in­cred­ibly un­likely event caused by spon­ta­neous and co­in­ci­den­tal co­or­di­na­tion be­tween lots of differ­ent molecules. We as­sume here that the AI can­not cause a ther­mo­dy­namic mir­a­cle, nor make it more likely (this may re­quire some spe­cific defi­ni­tions of what “counts” as a mir­a­cle).

The idea is to make the AI in­differ­ent to any world in which there is no ther­mo­dy­namic mir­a­cle. So give it a util­ity u such that u(w)=con­stant, for any world w with­out a ther­mo­dy­namic mir­a­cle at the point of in­ter­est. The value of that con­stant doesn’t mat­ter; the re­sult is that the AI’s ac­tions have no im­pact on util­ity in any world where the mir­a­cle didn’t hap­pen. Thus, as long as there is a tiny chance that the mir­a­cle hap­pened, the AI will be­have as if that was a cer­tainty: for only in those wor­lds do its ac­tions have any im­pact.

You can prob­a­bly see where I’m go­ing with this. Sup­pose the sig­nal goes through the wire to turn on an­other AI. That AI will only turn on given a very spe­cific 100-digit “ON” bi­nary code. Given any other code, the sec­ond AI won’t turn on, and will erase it­self. We send the “ON” code.

So what hap­pens? The first AI rea­sons that given a ther­mo­dy­namic mir­a­cle (the only wor­lds it cares about) the out­put sig­nal is go­ing to be 100 ran­dom bits, which have only 2-100 chance of be­ing “ON”. There­fore it will be­have as if it’s al­most cer­tain “ON” will not get through the gas.

And what will (al­most cer­tainly) hap­pen in re­al­ity? We send the “ON” sig­nal, and it gets through un­changed. The first AI ob­serves this (in fact, it ex­pected this). How­ever, the only pos­si­ble world that it cares about, now, is one in which the “ON” sig­nal was over­writ­ten by a ther­mo­dy­namic mir­a­cle… into ex­actly the same “ON” sig­nal. As long as the pos­si­bil­ity of this is non-zero, the AI will be­have as if it’s cer­tain.

We should de­velop this point a lit­tle. The AI is al­most cer­tain, ahead of time, that “ON” will hap­pen. It be­haves, how­ever, as if “ON” is al­most cer­tain not to hap­pen. Then, af­ter “ON” has gone through, it must adapt it­self and back­track on its pre­vi­ous be­havi­our. More­over, it knows that this is what it will (al­most cer­tainly) do. Since it knows ahead of time it will be back­track­ing, why does it still go ahead with its be­havi­our?

The an­swer is that most of the uni­verses in which it back­tracks, are uni­verses where its ac­tions have no con­se­quences, and hence it doesn’t care about them. So it might have 1-ε chance of back­track­ing, but it only cares about a pro­por­tion ε(2-100) of those wor­lds. On the other hand, it won’t back­track with prob­a­bil­ity ε(1-2-100), but it cares about those wor­lds com­pletely.

False Ther­mo­dy­namic Miracles

This is the ex­pla­na­tion for the ti­tle: the AI will be­have as if there was a real ther­mo­dy­namic mir­a­cle (a spon­ta­neous over­writ­ing of a sig­nal with the same sig­nal) while we “know” that ac­tu­ally it was just the same sig­nal go­ing through.

By vary­ing what we put at the other end of the wire, we can get the AI to acts “as if” it be­lieved al­most any­thing, with­out in­tro­duc­ing prob­le­matic be­liefs. We just have to be a bit cau­tious with our se­tups, by con­sid­er­ing how hu­mans would re­act if there was a real ther­mo­dy­namic mir­a­cle (be­cause the AI is cer­tainly go­ing to be con­sid­er­ing that very care­fully).

This al­lows us to con­sider AI de­signs with coun­ter­fac­tu­als, false “be­liefs”, etc..., which will be of great use. Are there any flaws I’m miss­ing? Prac­ti­cal flaws are ok (eg how do you define a ther­mo­dy­namic mir­a­cle and make sure the AI can’t af­fect it), but the­o­ret­i­cal flaws are much more im­por­tant at this stage.

In equations

Edit: (thanks to TylerJay)

In equa­tion form, the AI is maximising

P(¬X)*C + P(X)*u(X, A),

for some con­stant C, some un­likely event X that the AI can­not af­fect, some set of rele­vant de­scrip­tors A, and some util­ity u. Since C is con­stant, this is ex­actly the same as max­imis­ing u(X, A) - the prob­a­bil­ity P(X) is ir­rele­vant.

The whole setup de­scribed is sim­ply a way to en­sure that if W is the likely set of wor­lds con­sis­tent with ob­ser­va­tions af­ter ¬X/​X, then

P(W) ≈ P(¬X) ≈ 1 (we “know” that X doesn’t hap­pen and that we end up in W),

while

P(W|X) << 1 (in the wor­lds it cares about, the AI be­haves as if W was in­cred­ibly un­likely to come about).

Same re­sult, computationally

The above was a “phys­i­cal” de­sign, re­volv­ing around a par­tic­u­lar phys­i­cal setup, plus some value hack­ing. But we can do the same in a purely com­pu­ta­tional way, by adapt­ing dr­nick­bone’s de­sign.

To adapt that de­sign, as­sume Ω will run UDT/​TDT agent A to get its es­ti­ma­tion as to the likely prob­a­bil­ity of any one of 2100 se­quences of bits com­ing down the wire. It will then choose to send the sig­nal that A as­signs the low­est prob­a­bil­ity to. If there is a tie, it will choose the first se­quence, lex­i­co­graph­i­cally.

It seems that A must as­sign equal prob­a­bil­ity 2-100 to each se­quence (as any in­crease in prob­a­bil­ity makes that se­quence less likely to ap­pear, and any de­crease makes it more likely). There­fore, we “know” that the sig­nal will be “0000...0000″, but the AI can­not as­sign a higher prob­a­bil­ity to that. This achieves the same out­come as above.

I per­son­ally feel the phys­i­cal de­sign is more se­cure—it seems eas­ier to define what a ther­mo­dy­namic mir­a­cle at a given point in space­time would be, than to be cer­tain that the AI will not evade the com­pu­ta­tional re­stric­tions some­how, or that the com­pu­ta­tional setup will re­main sta­ble and well defined. Also, A may not spread ex­actly equal prob­a­bil­ity over 2100 se­quences, mess­ing up our cer­tainty as to the re­sult.