Oracles, sequence predictors, and self-confirming predictions

My coun­ter­fac­tual Or­a­cle de­sign uses a util­ity func­tion/​re­ward func­tion in or­der to train it to give the right pre­dic­tions. Paul Chris­ti­ano asked whether the whole util­ity func­tion ap­proach was nec­es­sary, and whether it could be re­placed with a sim­ple se­quence pre­dic­tor.

It turns out the an­swers are no, and yes. The coun­ter­fac­tual na­ture of the ap­proach can be pre­served, with­out need­ing a util­ity func­tion.

Coun­ter­fac­tual se­quence prediction

There is a boxed Or­a­cle, , which gets fed a se­ries of ob­ser­va­tions , and must make a pre­dic­tion about the next ob­ser­va­tion.

If the pre­dic­tion was sent out into the world, then is at­tempt­ing to make into a self-con­firm­ing pre­dic­tion, with all the prob­lems that this could en­tail.

How­ever, we make into a coun­ter­fac­tual Or­a­cle; on some oc­ca­sions, the out­put is erased, and not seen by any­one. In that case, the Or­a­cle will get a spe­cial mes­sage (for “era­sure”), that will be added on af­ter of .

Then the job of the coun­ter­fac­tual Or­a­cle is, given a his­tory (in­clud­ing past “”s), to pro­duce a pre­dic­tion that is the best pre­dic­tion for given the his­tory . In terms of words, the Or­a­cle is try­ing to best pre­dict the next ob­ser­va­tion, given that era­sure is as­sumed to hap­pen (the defi­ni­tion of “best” de­pends on the de­tails of the de­sign).

If our setup is well de­signed and there is no in­for­ma­tion leak­age, then any has a prob­a­bil­ity that is in­de­pen­dent of the pre­dic­tion :

  • .

Mean­ing of predictions

If we ran the or­a­cle with­out the as­sump­tion that would be ob­served, we’d have the stan­dard self-con­firm­ing pre­dic­tor (that be­ing the only cor­rect pre­dic­tions); with that as­sump­tion, it be­comes a coun­ter­fac­tual pre­dic­tor.

Though coun­ter­fac­tu­als are con­cep­tu­ally tricky, it’s ac­tu­ally eas­ier to un­der­stand what a coun­ter­fac­tual pre­dic­tor is do­ing than a gen­eral se­quence pre­dic­tor Or­a­cle. We un­der­stand that a gen­eral se­quence pre­dic­tor is try­ing to find a self-con­firm­ing pre­dic­tion—one where is the best es­ti­mate of , given - but un­less we un­der­stand the struc­ture of the world very well, we don’t have a clue where that self-con­firm­ing pre­dic­tion might be.

In con­trast, the coun­ter­fac­tual pre­dic­tion is easy to in­ter­pret: “had we not seen , that is what would have been”; there are no loops in the defi­ni­tion.

Ex­tent of ma­nipu­la­tion via prediction

Note that the Or­a­cle can be used to es­ti­mate the ex­tent to which the pre­dic­tion is ma­nipu­la­tive, by con­trast­ing its pre­dic­tions for given the , and given . This re­veals the ex­tent to which the pre­dic­tion is po­ten­tially ma­nipu­la­tive.

Non-ex­plicit erasure

If the Or­a­cle isn’t given an ex­plicit era­sure mes­sage, but has a causal model of the world, we can still run the coun­ter­fac­tual Or­a­cle by hav­ing it choose to be the best fit for the , con­di­tional on and on the era­sure hap­pen­ing.

Fac­tor­ing out cer­tain routes

As­sume now that the AI is an ac­tive AI, that can take ac­tions in the world, af­fect­ing the ob­ser­va­tion . So we’ll re­place ter­minol­ogy with ; note that the pre­dic­tions were already ac­tions, just ac­tions of a par­tic­u­lar type in par­tic­u­lar setup.

We want to have the AI only con­sider cer­tain routes to in­fluence . This might be be­cause there is a re­ward-func­tion learn­ing pro­cess in­volved, and we don’t want the AI to in­fluence the learn­ing pro­cess. Or maybe there is a cor­rigi­bil­ity but­ton in­volved, and we don’t want the AI to try and en­sure it is pressed or not pressed.

In any case, la­bel the banned routes of in­fluence by the vari­ables :

Note that the Or­a­cle case is just a spe­cial case of this, where cor­re­sponds to “hu­mans read­ing the pre­dic­tion”, and there be­ing no di­rect ar­row from to .

Fac­tor­ing out

As­sume that is the de­fault ac­tion. Then we want the AI to max­imise some ex­pected score of and , but act­ing as if the val­ues of were set by rather than the value of .

Let be the pos­si­ble val­ues of . We re­quire one more as­sump­tion: that if is pos­si­ble given , then it is also pos­si­ble given any other ac­tion (tech­ni­cally, we only need for the con­di­tional prob­a­bil­ities to make sense).

Then if the AI has suffi­cient data to es­ti­mate ex­pres­sions like , we act as if the AI was off-policy where is con­cerned. Namely, we get it to com­pute the ac­tion that max­imises:

  • .

Note that if we had in­stead of in that ex­pres­sion, then this would just be the clas­si­cal .

This con­struc­tion is es­sen­tially the same as the coun­ter­fac­tu­ally un­in­fluence­able agent de­sign, with rep­re­sent­ing the weights of the differ­ent re­ward func­tions that the AI was to “learn”.

No comments.