The “best predictor is malicious optimiser” problem

Sup­pose you are a friendly AI and have a mys­te­ri­ous black box . out­puts a se­quence of bits. You want to pre­dict the next bits that will out­put. For­tu­nately, you have a magic Tur­ing ma­chine or­a­cle . You can give any com­putable func­tion Tur­ing ma­chine, does it Halt? , What does it out­put? , how long does it take? and the or­a­cle will find the tur­ing ma­chine that max­imises this func­tion, or re­turn “no max­i­mum ex­ists”.

In par­tic­u­lar, can be any com­bi­na­tion of length, run­time and ac­cu­racy at pre­dict­ing . Maybe you set on any TM’s that don’t pre­dict and num­ber of states on any ma­chines that do.

So you take the Tur­ing ma­chine given to you by the or­a­cle and look at it. In AIXI would be the short­est TM that makes cor­rect pre­dic­tions. In log­i­cal in­duc­tion, would be a short and fast TM that made mostly cor­rect pre­dic­tions, and would be a func­tion that was slow to com­pute.

Now you look in­side , to find out what it does. In­side you find a pa­per­clip max­imiser . That isn’t to say that . might be simu­lat­ing some laws of physics, with run­ning on a com­puter con­tained within that physics. be­lieves that the world it sees is con­tained within a hy­poth­e­sis be­ing con­sid­ered by . is in con­trol of the out­put of . In fact has a good un­der­stand­ing of how thinks, and has its own hy­poth­e­sis about . wants to max­imise the num­ber of pa­per­clips in the real world. The op­ti­miza­tion tar­get of is to out­put a prob­a­bil­ity dis­tri­bu­tion, such that when uses this prob­a­bil­ity dis­tri­bu­tion as part of its al­gorithm to pre­dict B, then A takes an ac­tion that makes lots of pa­per­clips. is aware that get­ting caught out mak­ing wrong pre­dic­tions will lower its stand­ing as a good hy­poth­e­sis with .

What should do?

Pos­si­ble an­swers are spoilered, in case you want to think of your own an­swer.

If you can think of a differ­ent an­swer, or you have an opinion about which an­swer is bet­ter please post it.


Use any­way. After all s the short­est/​ short and fastest pro­gram that acu­rately pre­dicts . For all you know, is a com­puter simu­lat­ing . Your as­sign­ing a few per­cent of prob­a­bil­ity to differ­ent longer TM’s aren’t you? That + ex­pected util­ity max­i­miza­tion should stop you do­ing any­thing too bad in the off chance that doesn’t pre­dict


Find the pre­dic­tor in . Clearly can pre­dict well when it wants to, find this pre­dic­tor and cut it loose from the pa­per­clip max­i­miz­ing part. This might be tricky if is some­thing like a hu­man brain run­ning in a Quan­tum me­chan­ics simu­la­tion. There might be no easy ob­vi­ous ways to cut the parts apart.


Dis­card en­tirely. Use a prob­a­bil­ity dis­tri­bu­tion over hy­poth­e­sis that aren’t try­ing to ma­nipu­late you. Fails if is ac­tu­ally simu­lat­ing .


Use a mea­sure where this can’t hap­pen. Maybe if you op­ti­mize for fast run­time strongly enough? Maybe if you use a differ­ent model of com­pu­ta­tion? Set up a situ­a­tion where mal­i­cious op­ti­misers never have much (any?) of your prob­a­bil­ity mass. How much prob­a­bil­ity do you need to as­sign to a mal­i­cious op­ti­miser be­fore it causes dam­age?


Use as long as it car­ries on try­ing to pre­dict, throw it out as soon as you spot it try­ing to de­ceive. This is ba­si­cally us­ing 1) and 2) if they agree, throw­ing both out if they dis­agree. So if is think­ing ” will out­put and I should say to in­crease ’s trust in me” then use . If thinks ” will out­put and I should say to trick ” then dis­card .


Some prob­a­bil­ity dis­tri­bu­tion over the above.