Logical uncertainty, kind of. A proposal, at least.

If you want con­text and are fast at see­ing the im­pli­ca­tions of math, see Benja’s post. This post is much lighter on the math, though it may take more back­ground read­ing and more la­bo­ri­ous in­ter­po­la­tion, since it’s, well, lighter on the math.

Imag­ine I in­tro­duced my pet robot to a game. The robot has 10 sec­onds to pick a digit, and if the trillionth prime num­ber ends with that digit, the robot gets a cookie (it likes peanut but­ter cook­ies the best). 10 sec­onds is not enough time for my robot to calcu­late the an­swer de­duc­tively. And yet, guess­ing an an­swer is su­pe­rior to run­ning out of time quietly. What sort of gen­eral logic should my robot fol­low in un­der 10 sec­onds to figure out that it should be in­differ­ent be­tween an­swer­ing 1, 3, 7 or 9? Does it even make sense to be in­differ­ent be­tween the real an­swer and an im­pos­si­ble an­swer, even if you don’t know which is which?

As you might ex­pect from con­text, the pro­posed solu­tion will in­volve as­sign­ing ev­ery true or false math state­ment a prob­a­ba­bil­ity-es­que de­gree of plau­si­bil­ity, with num­bers other than 0 or 1 in­di­cat­ing log­i­cal un­cer­tainty. Why is this a good idea?

To ex­plain log­i­cal un­cer­tainty, let’s first take a step back and re­frame log­i­cal cer­tainty in terms of rules for rea­son­ing that ap­ply to both de­duc­tive logic and prob­a­bil­is­tic logic. An im­por­tant re­source here is E.T. Jaynes’ Prob­a­bil­ity The­ory (pdf) - the most rele­vant part be­ing page 31 of the book. The key idea is that each of the prob­a­bil­ity ax­ioms ap­plies just fine no mat­ter what kind of Boolean state­ment you want to find the prob­a­bil­ity of. Which is to say prob­a­bil­ity already ap­plies to ar­ith­metic—the laws of prob­a­bil­ity are also laws of ar­ith­metic, just in the limit that prob­a­bil­ities go to 1 or 0. Our robot starts with a col­lec­tion of defi­ni­tions la­beled with prob­a­bil­ity 1 (like “0 is a num­ber” or “S(0)+0=S(0)” [if this S(0) stuff needs con­text, see wikipe­dia]), and then ap­plies de­duc­tive rules ac­cord­ing to the uni­ver­sal rules of prob­a­bil­ity. We trans­late “A im­plies B” into the lan­guage of prob­a­bil­ities as P(AB|C) = P(A|C), and then ap­ply the always-true product rule P(B|AC)=P(AB|C) /​ P(A|C). If P(A|C)=1, that is, A|C is de­duc­tively true, and A im­plies B, then P(B|AC)=P(B|C)=1. The ma­chin­ery that un­der­lies de­duc­tion is in fact the same ma­chin­ery that un­der­lies prob­a­bil­is­tic rea­son­ing. And we’re just go­ing to ex­ploit that a lit­tle.

An al­ter­nate ax­iom­a­ti­za­tion due to Sav­age (hat tip to ar­ti­cles by Sniffoy and fool) is based just on ac­tions—it doesn’t seem nec­es­sary for ev­ery agent to store nu­mer­i­cal plau­si­bil­ities, but ev­ery agent has to act, and if our agent is to act as if it had con­sis­tent prefer­ences when pre­sented with bets, it must act as if it calcu­lated prob­a­bil­ities. Just like the con­di­tions of Cox’s the­o­rem as used by E.T. Jaynes, the con­di­tions of Sav­age’s the­o­rem ap­ply to bets on ar­ith­metic just fine. So our robot always be­haves as if it as­signs some prob­a­bil­ities over the last digit of the trillionth prime num­ber—it’s just that when our robot’s al­lowed to run long enough, all but one of those prob­a­bil­ities is 0.

So how do we take the ba­sic laws of be­lief-ma­nipu­la­tion, like the product rule or the sum rule, and ap­ply them to cases where we run out of time and can’t de­duce all the things? If we still want to take ac­tions, we still want to as­sign prob­a­bil­ities, but we can’t use de­duc­tion more than a set num­ber of times...

Okay fine I’ll just say it. The pro­posal out­lined here is to treat a com­pu­ta­tion­ally limited agent’s “cor­rect be­liefs” as the cor­rect be­liefs of a com­pu­ta­tion­ally un­limited agent with a limited defi­ni­tion of what de­duc­tion can do. So this weak­ened-de­duc­tion agent has a limi­ta­tion, in that start­ing from ax­ioms it can only prove some small pool of the­o­rems, but it’s un­limited in that it can take the pool of proven the­o­rems, and then as­sign prob­a­bil­ities to all the un­proven true or false state­ments. After we flesh out this agent, we can find a com­pu­ta­tion­ally limited al­gorithm that finds cor­rect (i.e. equal to the ones from a sen­tence ago) prob­a­bil­ities for spe­cific state­ments, rather than all of them. And fi­nally, we have to take this and make a de­ci­sion pro­ce­dure—our robot. After all, it’s no good for our robot to as­sign prob­a­bil­ities if it pro­ceeds to get stuck be­cause it tries to com­pare the util­ities of the world if the end of the trillionth prime num­ber were 1 ver­sus 7 and doesn’t even know what it means to calcu­late the util­ity of the im­pos­si­ble. We have to make a bit of a mod­ifi­ca­tion to the whole de­ci­sion pro­ce­dure, we can’t just throw in prob­a­bil­ities and ex­pect util­ity to keep up.

So, for­mally, what’s go­ing on when we limit de­duc­tion? Well, re­mem­ber the pro­cess of de­duc­tion out­lined ear­lier?

We trans­late “A im­plies B” into the lan­guage of prob­a­bil­ities as P(AB|C) = P(A|C), and then ap­ply the always-true product rule P(B|AC)=P(AB|C) /​ P(A|C). If P(A|C)=1, that is, A|C is de­duc­tively true, and A im­plies B, then P(B|AC)=P(B|C)=1

There is a chain here, and if we want to limit de­duc­tion to some small pool of prov­able the­o­rems, we need one of the links to be bro­ken out­side that pool. As im­plied, I don’t want to mess with the product rule, or else we vi­o­late a desider­a­tum of be­lief. In­stead, we’ll mess with im­pli­ca­tion it­self—we trans­late “A im­plies B” into “P(AB|C)=P(A|C) only if we’ve spent less than 10 sec­onds do­ing de­duc­tion.” Or “P(AB|C)=P(A|C) only if it’s been less than 106 steps from the ba­sic ax­ioms.” Th­ese limi­ta­tions are ugly and non­lo­cal be­cause they rep­re­sent the in­tru­sion of non­lo­cal limi­ta­tions on our agent into a sys­tem that pre­vi­ously ran for­ever.

Note that the weak­en­ing of im­pli­ca­tion does not nec­es­sar­ily de­ter­mine the shape of our pool of de­duced the­o­rems. A weak­ened-de­duc­tion agent could spiral out­ward from short­est to longest the­o­rems, or it could search more clev­erly to ad­vance on some spe­cific the­o­rems be­fore time runs out.

If a weak­ened-de­duc­tion agent just had the product rule and this new way of trans­lat­ing the ax­ioms into prob­a­bil­ities, it would ac­cu­mu­late some pool of known prob­a­bil­ities—it could work out from the prob­a­bil­ity-1 ax­ioms to show that some short state­ments had prob­a­bil­ity 1 and some other short state­ments had prob­a­bil­ity 0. It could also prove some more ab­stract things like P(AB)=0 with­out prov­ing any­thing else about A or B, as long as it fol­lowed the right search pat­tern. But it can’t as­sign prob­a­bil­ities out­side of de­duc­tion—it doesn’t have the rules. So it just ends up with a pool of de­duced stuff in the mid­dle of a blank plain of “un­defined.”

Okay, back to refer­ring to E.T. Jaynes (speci­fi­cally, the bot­tom of page 32). When de­riv­ing the laws of prob­a­bil­ity from Cox’s desider­ata, the ax­ioms fall into differ­ent groups—there are the “laws of thought” parts, and the “in­ter­face” parts. The laws of thought are things like Bayes’ the­o­rem, or the product rule. They tell you how prob­a­bil­ities have to fit with other prob­a­bil­ities. But they don’t give you prob­a­bil­ities ex nihilo, you have to start with prob­a­bil­ity-1 ax­ioms or known prob­a­bil­ities and build out from them. The parts that tell you how to get new prob­a­bil­ities are the in­ter­face parts, ideas like “if you have equiv­a­lent in­for­ma­tion about two things, they should have the same prob­a­bil­ity.”

So what does our limited-de­duc­tion agent do once it reaches its limits of de­duc­tion? Well, to put it sim­ply, it uses de­duc­tion as much as it can, and then it uses the prin­ci­ple of max­i­mum en­tropy for the prob­a­bil­ity of ev­ery­thing else. Max­i­mum en­tropy cor­re­sponds to min­i­mum in­for­ma­tion, so it satis­fies a desider­a­tum like “don’t make stuff up.”

The agent is as­sign­ing prob­a­bil­ities to true or false log­i­cal state­ments, state­ments like S(0)+S(0)=S(S(0)). If it had an un­re­stricted trans­la­tion of “A im­plies B,” it could prove this state­ment quickly. But sup­pose it can’t. Then this state­ment is re­ally just a string of sym­bols. The agent no longer “un­der­stands” the sym­bols, which is to say it can only use facts about the prob­a­bil­ity of these sym­bols that were pre­vi­ously proved and are within the pool of the­o­rems—it’s only a part of an al­gorithm, and doesn’t have the re­sources to prove ev­ery­thing, so we have to de­sign the agent to as­sign prob­a­bil­ities based just on what it proved de­duc­tively.

So the de­sign of our un­limited-com­pu­ta­tion, limited-de­duc­tion agent is that it does all the de­duc­tion it can ac­cord­ing to some search al­gorithm and within some limit, and this can be speci­fied to take any amount of time. Then, to fill up the in­finity of un-de­duced prob­a­bil­ities, the agent just as­signs the max­i­mum-en­tropy prob­a­bil­ity dis­tri­bu­tion con­sis­tent with what’s proven. For clever search strate­gies that figure out things like P(AB)=0 with­out figur­ing out P(A), do­ing this as­sign­ment re­quires in­ter­pre­ta­tion of AND, OR, and NOT op­er­a­tions—that is, we still need a Boolean alge­bra for state­ments. But our robot no longer proves new state­ments about prob­a­bil­ities of these sym­bol strings, in the sense that P(S(0)+0=S(0))=P(S(0)+S(0)=S(S(0))) is a new state­ment. An ex­am­ple of a non-new state­ment would be P(S(0)+0=S(0)) AND S(0)+S(0)=S(S(0))) = P(S(0)+0=S(0)) * P(S(0)+S(0)=S(S(0)) | S(0)+0=S(0)) - that’s just the product rule, it hasn’t ac­tu­ally changed any of the equa­tions.

End of part 1 ex­er­cise: Can de­duc­ing an ad­di­tional the­o­rem lead to our agent as­sign­ing less prob­a­bil­ity to the right an­swer un­der cer­tain situ­a­tions? (Read­ing part 2 may help)

Okay, now on to do­ing this with ac­tual bounded re­sources. And back to the trillionth prime num­ber! You al­most for­got about that, didn’t you. The plan is to break up the strict de­duc­tion → max en­tropy pro­ce­dure, and do it in such a way that our robot can get bet­ter re­sults (higher prob­a­bil­ity to the cor­rect an­swer) the longer it runs, up to prov­ing the ac­tual cor­rect an­swer. It starts with no the­o­rems, and figures out the max en­tropy prob­a­bil­ity dis­tri­bu­tion for the end of the trillionth prime num­ber. Said dis­tri­bu­tion hap­pens to be one-half to ev­ery­thing, e.g. p(1)=1/​2 and p(2)=1/​2 and p(3)=1/​2. The robot doesn’t know yet that the differ­ent an­swers are mu­tu­ally ex­clu­sive and ex­haus­tive, much less what’s wrong with the an­swer of 2. But the im­por­tant thing is, as­sign­ing the same num­ber to ev­ery­thing of in­ter­est is fast. Later, as it proves rele­vant the­o­rems, the robot up­dates the prob­a­bil­ity dis­tri­bu­tion, and when it runs out of re­sources it stops.

Side note: there’s also an­other way of imag­in­ing how the robot stores prob­a­bil­ities, used in Benja’s post, which is to con­struct a re­ally big mu­tu­ally ex­clu­sive and ex­haus­tive ba­sis (called “dis­junc­tive nor­mal form”). In­stead of stor­ing P(A) and P(B), which are not nec­es­sar­ily mu­tu­ally ex­clu­sive or ex­haus­tive, we store P(AB), P(A¬B) (the hook thingy means “NOT”), P(¬AB), and P(¬A¬B), which are mu­tu­ally ex­clu­sive and ex­haus­tive. Th­ese things would then each have prob­a­bil­ity 14, or 12N, where N is the num­ber of state­ments you’re as­sign­ing prob­a­bil­ities to. This is a pain when N goes to in­finity, but can be use­ful when N is ap­prox­i­mately the num­ber of pos­si­ble last digits of a num­ber.

Back on track: sup­pose the first thing the robot proves about the last digit of the trillionth prime num­ber is that an­swers of 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0 are ex­haus­tive. What does that do to the prob­a­bil­ities? In dis­junc­tive nor­mal form, the change is clear—ex­haus­tive­ness means that P(¬1¬2¬3¬4¬5¬6¬7¬8¬9¬0)=0, there’s no lef­tover space. Pre­vi­ously there were 210=1024 of these dis­junc­tive pos­si­bil­ities, now there are 1023, and the re­main­ing ones stay equiv­a­lent in terms of what’s been proven about them (noth­ing), so the prob­a­bil­ity of each went from 1/​1024 to 1/​1023. Two things to note: figur­ing this out took a small amount of work and is to­tally doable for the robot, but we don’t want to do this work ev­ery time we use modus tol­lens, so we need to have some way to tell whether our new the­o­rem mat­ters to the trillionth prime num­ber.

For ex­am­ple, image we were in­ter­ested in the state­ment A. The ex­am­ple is to learn that A, B, and C are mu­tu­ally ex­clu­sive and ex­haus­tive, step by step. First, we could prove that A, B, C are ex­haus­tive—P(¬A¬B¬C)=0. Does this change P(A)? Yes, it changes from 48 (N is 3, so 23=8) to 47. Then we learn that P(AB)=0, i.e. A and B are mu­tu­ally ex­clu­sive. This leaves us only A¬BC, ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A) is now 25. Now we learn that A and C are mu­tu­ally ex­clu­sive, so the pos­si­bil­ities are ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A)=1/​4. Each of the steps un­til now have had the state­ment A right there in­side the paren­the­ses—but for the last step, we show that B and C are mu­tu­ally ex­clu­sive, P(BC)=0, and now we just have P(A)=P(B)=P(C)=1/​3. We just took a step that didn’t men­tion A, but it changed the prob­a­bil­ity of A. This is be­cause we’d pre­vi­ously dis­rupted the bal­ance be­tween ABC and ¬ABC. To tell when to up­date P(A) we not only need to listen for A to be men­tioned, we have to track what A has been en­tan­gled with, and what’s been en­tan­gled with that, and so on in a web of de­duced re­la­tion­ships.

The good news is that that’s it. The plau­si­bil­ity as­signed to any state­ment A by this finite-com­pu­ta­tion method is the same plau­si­bil­ity that our com­pu­ta­tion­ally-un­limited de­duc­tively-limited agent would have as­signed to it, given the same pool of de­duced the­o­rems. The differ­ence is just that the limited-de­duc­tion agent did this for ev­ery pos­si­ble state­ment, which as men­tioned doesn’t make as much sense in dis­junc­tive nor­mal form.

So IF we ac­cept that hav­ing limited re­sources is like hav­ing a limited abil­ity to do im­pli­ca­tion, THEN we know how our robot should as­sign prob­a­bil­ities to a few state­ments of in­ter­est. It should start with the good old “ev­ery­thing gets prob­a­bil­ity 12,” which should al­low it to win some cook­ies even if it only has a few mil­lisec­onds, and then it should start prov­ing the­o­rems, up­dat­ing its prob­a­bil­ities when it proves some­thing that should im­pact those prob­a­bil­ities.

Now onto the last part. The robot’s util­ity func­tion wasn’t re­ally de­signed for U(last digit of trillionth prime num­ber is 1), so what should it do? Well, what does our robot like? It likes hav­ing a cookie over not hav­ing a cookie. C is for cookie, and that’s good enough for it. So we want to trans­form a util­ity over cook­ies into a an ex­pected util­ity that will let us or­der pos­si­ble ac­tions.

We have to make the ex­act same trans­for­ma­tion in the case of or­di­nary prob­a­bil­ities, so let’s ex­am­ine that. If I flip a coin and get a cookie if I call it cor­rectly, I don’t have a ter­mi­nal U(heads) or U(tails), I just have U(cookie). My ex­pected util­ity of differ­ent guesses comes from not know­ing which guess leads to the cookie.

Similarly, the ex­pected util­ity of differ­ent guesses when bet­ting on the trillionth prime num­ber comes from not know­ing which guess leads to the cookie. It is pos­si­ble to care about the prop­er­ties of math, or to care about whether coins land heads or tails, but that just means we have to drag in causal­ity—your guess doesn’t af­fect how math works, or flip coins over.

So the stan­dard pro­ce­dure for our robot looks like this:

Start with some util­ity func­tion U over the world, speci­fi­cally cook­ies.

Now, face a prob­lem. This prob­lem will have some out­comes (pos­si­ble num­bers of cook­ies), some op­tions (that is, strate­gies to fol­low, like choos­ing one of 10 pos­si­ble digits), and any amount of in­for­ma­tion about how op­tions cor­re­spond to out­comes (like “iff the trillionth prime ends with this digit, you get the cookie”).

Now our robot calcu­lates the limited-re­sources prob­a­bil­ity of get­ting differ­ent out­comes given differ­ent strate­gies, and from that calcu­lates an ex­pected util­ity for each strat­egy.

Our robot then fol­lows one of the strate­gies with max­i­mum ex­pected util­ity.

Bonus ex­er­cises: Does this pro­ce­dure already han­dle prob­a­bil­is­tic maps from the op­tions to the out­comes, like in the case of the flipped coin? How about if flip­ping a coin isn’t already con­verted into a prob­a­bil­ity, but is left as an un­der­de­ter­mined prob­lem a la “a coin (heads XOR tails) is flipped, choose one.”