# Logical uncertainty, kind of. A proposal, at least.

If you want con­text and are fast at see­ing the im­pli­ca­tions of math, see Benja’s post. This post is much lighter on the math, though it may take more back­ground read­ing and more la­bo­ri­ous in­ter­po­la­tion, since it’s, well, lighter on the math.

Imag­ine I in­tro­duced my pet robot to a game. The robot has 10 sec­onds to pick a digit, and if the trillionth prime num­ber ends with that digit, the robot gets a cookie (it likes peanut but­ter cook­ies the best). 10 sec­onds is not enough time for my robot to calcu­late the an­swer de­duc­tively. And yet, guess­ing an an­swer is su­pe­rior to run­ning out of time quietly. What sort of gen­eral logic should my robot fol­low in un­der 10 sec­onds to figure out that it should be in­differ­ent be­tween an­swer­ing 1, 3, 7 or 9? Does it even make sense to be in­differ­ent be­tween the real an­swer and an im­pos­si­ble an­swer, even if you don’t know which is which?

As you might ex­pect from con­text, the pro­posed solu­tion will in­volve as­sign­ing ev­ery true or false math state­ment a prob­a­ba­bil­ity-es­que de­gree of plau­si­bil­ity, with num­bers other than 0 or 1 in­di­cat­ing log­i­cal un­cer­tainty. Why is this a good idea?

To ex­plain log­i­cal un­cer­tainty, let’s first take a step back and re­frame log­i­cal cer­tainty in terms of rules for rea­son­ing that ap­ply to both de­duc­tive logic and prob­a­bil­is­tic logic. An im­por­tant re­source here is E.T. Jaynes’ Prob­a­bil­ity The­ory (pdf) - the most rele­vant part be­ing page 31 of the book. The key idea is that each of the prob­a­bil­ity ax­ioms ap­plies just fine no mat­ter what kind of Boolean state­ment you want to find the prob­a­bil­ity of. Which is to say prob­a­bil­ity already ap­plies to ar­ith­metic—the laws of prob­a­bil­ity are also laws of ar­ith­metic, just in the limit that prob­a­bil­ities go to 1 or 0. Our robot starts with a col­lec­tion of defi­ni­tions la­beled with prob­a­bil­ity 1 (like “0 is a num­ber” or “S(0)+0=S(0)” [if this S(0) stuff needs con­text, see wikipe­dia]), and then ap­plies de­duc­tive rules ac­cord­ing to the uni­ver­sal rules of prob­a­bil­ity. We trans­late “A im­plies B” into the lan­guage of prob­a­bil­ities as P(AB|C) = P(A|C), and then ap­ply the always-true product rule P(B|AC)=P(AB|C) /​ P(A|C). If P(A|C)=1, that is, A|C is de­duc­tively true, and A im­plies B, then P(B|AC)=P(B|C)=1. The ma­chin­ery that un­der­lies de­duc­tion is in fact the same ma­chin­ery that un­der­lies prob­a­bil­is­tic rea­son­ing. And we’re just go­ing to ex­ploit that a lit­tle.

An al­ter­nate ax­iom­a­ti­za­tion due to Sav­age (hat tip to ar­ti­cles by Sniffoy and fool) is based just on ac­tions—it doesn’t seem nec­es­sary for ev­ery agent to store nu­mer­i­cal plau­si­bil­ities, but ev­ery agent has to act, and if our agent is to act as if it had con­sis­tent prefer­ences when pre­sented with bets, it must act as if it calcu­lated prob­a­bil­ities. Just like the con­di­tions of Cox’s the­o­rem as used by E.T. Jaynes, the con­di­tions of Sav­age’s the­o­rem ap­ply to bets on ar­ith­metic just fine. So our robot always be­haves as if it as­signs some prob­a­bil­ities over the last digit of the trillionth prime num­ber—it’s just that when our robot’s al­lowed to run long enough, all but one of those prob­a­bil­ities is 0.

So how do we take the ba­sic laws of be­lief-ma­nipu­la­tion, like the product rule or the sum rule, and ap­ply them to cases where we run out of time and can’t de­duce all the things? If we still want to take ac­tions, we still want to as­sign prob­a­bil­ities, but we can’t use de­duc­tion more than a set num­ber of times...

Okay fine I’ll just say it. The pro­posal out­lined here is to treat a com­pu­ta­tion­ally limited agent’s “cor­rect be­liefs” as the cor­rect be­liefs of a com­pu­ta­tion­ally un­limited agent with a limited defi­ni­tion of what de­duc­tion can do. So this weak­ened-de­duc­tion agent has a limi­ta­tion, in that start­ing from ax­ioms it can only prove some small pool of the­o­rems, but it’s un­limited in that it can take the pool of proven the­o­rems, and then as­sign prob­a­bil­ities to all the un­proven true or false state­ments. After we flesh out this agent, we can find a com­pu­ta­tion­ally limited al­gorithm that finds cor­rect (i.e. equal to the ones from a sen­tence ago) prob­a­bil­ities for spe­cific state­ments, rather than all of them. And fi­nally, we have to take this and make a de­ci­sion pro­ce­dure—our robot. After all, it’s no good for our robot to as­sign prob­a­bil­ities if it pro­ceeds to get stuck be­cause it tries to com­pare the util­ities of the world if the end of the trillionth prime num­ber were 1 ver­sus 7 and doesn’t even know what it means to calcu­late the util­ity of the im­pos­si­ble. We have to make a bit of a mod­ifi­ca­tion to the whole de­ci­sion pro­ce­dure, we can’t just throw in prob­a­bil­ities and ex­pect util­ity to keep up.

So, for­mally, what’s go­ing on when we limit de­duc­tion? Well, re­mem­ber the pro­cess of de­duc­tion out­lined ear­lier?

We trans­late “A im­plies B” into the lan­guage of prob­a­bil­ities as P(AB|C) = P(A|C), and then ap­ply the always-true product rule P(B|AC)=P(AB|C) /​ P(A|C). If P(A|C)=1, that is, A|C is de­duc­tively true, and A im­plies B, then P(B|AC)=P(B|C)=1

There is a chain here, and if we want to limit de­duc­tion to some small pool of prov­able the­o­rems, we need one of the links to be bro­ken out­side that pool. As im­plied, I don’t want to mess with the product rule, or else we vi­o­late a desider­a­tum of be­lief. In­stead, we’ll mess with im­pli­ca­tion it­self—we trans­late “A im­plies B” into “P(AB|C)=P(A|C) only if we’ve spent less than 10 sec­onds do­ing de­duc­tion.” Or “P(AB|C)=P(A|C) only if it’s been less than 106 steps from the ba­sic ax­ioms.” Th­ese limi­ta­tions are ugly and non­lo­cal be­cause they rep­re­sent the in­tru­sion of non­lo­cal limi­ta­tions on our agent into a sys­tem that pre­vi­ously ran for­ever.

Note that the weak­en­ing of im­pli­ca­tion does not nec­es­sar­ily de­ter­mine the shape of our pool of de­duced the­o­rems. A weak­ened-de­duc­tion agent could spiral out­ward from short­est to longest the­o­rems, or it could search more clev­erly to ad­vance on some spe­cific the­o­rems be­fore time runs out.

If a weak­ened-de­duc­tion agent just had the product rule and this new way of trans­lat­ing the ax­ioms into prob­a­bil­ities, it would ac­cu­mu­late some pool of known prob­a­bil­ities—it could work out from the prob­a­bil­ity-1 ax­ioms to show that some short state­ments had prob­a­bil­ity 1 and some other short state­ments had prob­a­bil­ity 0. It could also prove some more ab­stract things like P(AB)=0 with­out prov­ing any­thing else about A or B, as long as it fol­lowed the right search pat­tern. But it can’t as­sign prob­a­bil­ities out­side of de­duc­tion—it doesn’t have the rules. So it just ends up with a pool of de­duced stuff in the mid­dle of a blank plain of “un­defined.”

Okay, back to refer­ring to E.T. Jaynes (speci­fi­cally, the bot­tom of page 32). When de­riv­ing the laws of prob­a­bil­ity from Cox’s desider­ata, the ax­ioms fall into differ­ent groups—there are the “laws of thought” parts, and the “in­ter­face” parts. The laws of thought are things like Bayes’ the­o­rem, or the product rule. They tell you how prob­a­bil­ities have to fit with other prob­a­bil­ities. But they don’t give you prob­a­bil­ities ex nihilo, you have to start with prob­a­bil­ity-1 ax­ioms or known prob­a­bil­ities and build out from them. The parts that tell you how to get new prob­a­bil­ities are the in­ter­face parts, ideas like “if you have equiv­a­lent in­for­ma­tion about two things, they should have the same prob­a­bil­ity.”

So what does our limited-de­duc­tion agent do once it reaches its limits of de­duc­tion? Well, to put it sim­ply, it uses de­duc­tion as much as it can, and then it uses the prin­ci­ple of max­i­mum en­tropy for the prob­a­bil­ity of ev­ery­thing else. Max­i­mum en­tropy cor­re­sponds to min­i­mum in­for­ma­tion, so it satis­fies a desider­a­tum like “don’t make stuff up.”

The agent is as­sign­ing prob­a­bil­ities to true or false log­i­cal state­ments, state­ments like S(0)+S(0)=S(S(0)). If it had an un­re­stricted trans­la­tion of “A im­plies B,” it could prove this state­ment quickly. But sup­pose it can’t. Then this state­ment is re­ally just a string of sym­bols. The agent no longer “un­der­stands” the sym­bols, which is to say it can only use facts about the prob­a­bil­ity of these sym­bols that were pre­vi­ously proved and are within the pool of the­o­rems—it’s only a part of an al­gorithm, and doesn’t have the re­sources to prove ev­ery­thing, so we have to de­sign the agent to as­sign prob­a­bil­ities based just on what it proved de­duc­tively.

So the de­sign of our un­limited-com­pu­ta­tion, limited-de­duc­tion agent is that it does all the de­duc­tion it can ac­cord­ing to some search al­gorithm and within some limit, and this can be speci­fied to take any amount of time. Then, to fill up the in­finity of un-de­duced prob­a­bil­ities, the agent just as­signs the max­i­mum-en­tropy prob­a­bil­ity dis­tri­bu­tion con­sis­tent with what’s proven. For clever search strate­gies that figure out things like P(AB)=0 with­out figur­ing out P(A), do­ing this as­sign­ment re­quires in­ter­pre­ta­tion of AND, OR, and NOT op­er­a­tions—that is, we still need a Boolean alge­bra for state­ments. But our robot no longer proves new state­ments about prob­a­bil­ities of these sym­bol strings, in the sense that P(S(0)+0=S(0))=P(S(0)+S(0)=S(S(0))) is a new state­ment. An ex­am­ple of a non-new state­ment would be P(S(0)+0=S(0)) AND S(0)+S(0)=S(S(0))) = P(S(0)+0=S(0)) * P(S(0)+S(0)=S(S(0)) | S(0)+0=S(0)) - that’s just the product rule, it hasn’t ac­tu­ally changed any of the equa­tions.

End of part 1 ex­er­cise: Can de­duc­ing an ad­di­tional the­o­rem lead to our agent as­sign­ing less prob­a­bil­ity to the right an­swer un­der cer­tain situ­a­tions? (Read­ing part 2 may help)

Okay, now on to do­ing this with ac­tual bounded re­sources. And back to the trillionth prime num­ber! You al­most for­got about that, didn’t you. The plan is to break up the strict de­duc­tion → max en­tropy pro­ce­dure, and do it in such a way that our robot can get bet­ter re­sults (higher prob­a­bil­ity to the cor­rect an­swer) the longer it runs, up to prov­ing the ac­tual cor­rect an­swer. It starts with no the­o­rems, and figures out the max en­tropy prob­a­bil­ity dis­tri­bu­tion for the end of the trillionth prime num­ber. Said dis­tri­bu­tion hap­pens to be one-half to ev­ery­thing, e.g. p(1)=1/​2 and p(2)=1/​2 and p(3)=1/​2. The robot doesn’t know yet that the differ­ent an­swers are mu­tu­ally ex­clu­sive and ex­haus­tive, much less what’s wrong with the an­swer of 2. But the im­por­tant thing is, as­sign­ing the same num­ber to ev­ery­thing of in­ter­est is fast. Later, as it proves rele­vant the­o­rems, the robot up­dates the prob­a­bil­ity dis­tri­bu­tion, and when it runs out of re­sources it stops.

Side note: there’s also an­other way of imag­in­ing how the robot stores prob­a­bil­ities, used in Benja’s post, which is to con­struct a re­ally big mu­tu­ally ex­clu­sive and ex­haus­tive ba­sis (called “dis­junc­tive nor­mal form”). In­stead of stor­ing P(A) and P(B), which are not nec­es­sar­ily mu­tu­ally ex­clu­sive or ex­haus­tive, we store P(AB), P(A¬B) (the hook thingy means “NOT”), P(¬AB), and P(¬A¬B), which are mu­tu­ally ex­clu­sive and ex­haus­tive. Th­ese things would then each have prob­a­bil­ity 14, or 12N, where N is the num­ber of state­ments you’re as­sign­ing prob­a­bil­ities to. This is a pain when N goes to in­finity, but can be use­ful when N is ap­prox­i­mately the num­ber of pos­si­ble last digits of a num­ber.

Back on track: sup­pose the first thing the robot proves about the last digit of the trillionth prime num­ber is that an­swers of 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0 are ex­haus­tive. What does that do to the prob­a­bil­ities? In dis­junc­tive nor­mal form, the change is clear—ex­haus­tive­ness means that P(¬1¬2¬3¬4¬5¬6¬7¬8¬9¬0)=0, there’s no lef­tover space. Pre­vi­ously there were 210=1024 of these dis­junc­tive pos­si­bil­ities, now there are 1023, and the re­main­ing ones stay equiv­a­lent in terms of what’s been proven about them (noth­ing), so the prob­a­bil­ity of each went from 1/​1024 to 1/​1023. Two things to note: figur­ing this out took a small amount of work and is to­tally doable for the robot, but we don’t want to do this work ev­ery time we use modus tol­lens, so we need to have some way to tell whether our new the­o­rem mat­ters to the trillionth prime num­ber.

For ex­am­ple, image we were in­ter­ested in the state­ment A. The ex­am­ple is to learn that A, B, and C are mu­tu­ally ex­clu­sive and ex­haus­tive, step by step. First, we could prove that A, B, C are ex­haus­tive—P(¬A¬B¬C)=0. Does this change P(A)? Yes, it changes from 48 (N is 3, so 23=8) to 47. Then we learn that P(AB)=0, i.e. A and B are mu­tu­ally ex­clu­sive. This leaves us only A¬BC, ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A) is now 25. Now we learn that A and C are mu­tu­ally ex­clu­sive, so the pos­si­bil­ities are ¬ABC, A¬B¬C, ¬AB¬C, and ¬A¬BC. P(A)=1/​4. Each of the steps un­til now have had the state­ment A right there in­side the paren­the­ses—but for the last step, we show that B and C are mu­tu­ally ex­clu­sive, P(BC)=0, and now we just have P(A)=P(B)=P(C)=1/​3. We just took a step that didn’t men­tion A, but it changed the prob­a­bil­ity of A. This is be­cause we’d pre­vi­ously dis­rupted the bal­ance be­tween ABC and ¬ABC. To tell when to up­date P(A) we not only need to listen for A to be men­tioned, we have to track what A has been en­tan­gled with, and what’s been en­tan­gled with that, and so on in a web of de­duced re­la­tion­ships.

The good news is that that’s it. The plau­si­bil­ity as­signed to any state­ment A by this finite-com­pu­ta­tion method is the same plau­si­bil­ity that our com­pu­ta­tion­ally-un­limited de­duc­tively-limited agent would have as­signed to it, given the same pool of de­duced the­o­rems. The differ­ence is just that the limited-de­duc­tion agent did this for ev­ery pos­si­ble state­ment, which as men­tioned doesn’t make as much sense in dis­junc­tive nor­mal form.

So IF we ac­cept that hav­ing limited re­sources is like hav­ing a limited abil­ity to do im­pli­ca­tion, THEN we know how our robot should as­sign prob­a­bil­ities to a few state­ments of in­ter­est. It should start with the good old “ev­ery­thing gets prob­a­bil­ity 12,” which should al­low it to win some cook­ies even if it only has a few mil­lisec­onds, and then it should start prov­ing the­o­rems, up­dat­ing its prob­a­bil­ities when it proves some­thing that should im­pact those prob­a­bil­ities.

Now onto the last part. The robot’s util­ity func­tion wasn’t re­ally de­signed for U(last digit of trillionth prime num­ber is 1), so what should it do? Well, what does our robot like? It likes hav­ing a cookie over not hav­ing a cookie. C is for cookie, and that’s good enough for it. So we want to trans­form a util­ity over cook­ies into a an ex­pected util­ity that will let us or­der pos­si­ble ac­tions.

We have to make the ex­act same trans­for­ma­tion in the case of or­di­nary prob­a­bil­ities, so let’s ex­am­ine that. If I flip a coin and get a cookie if I call it cor­rectly, I don’t have a ter­mi­nal U(heads) or U(tails), I just have U(cookie). My ex­pected util­ity of differ­ent guesses comes from not know­ing which guess leads to the cookie.

Similarly, the ex­pected util­ity of differ­ent guesses when bet­ting on the trillionth prime num­ber comes from not know­ing which guess leads to the cookie. It is pos­si­ble to care about the prop­er­ties of math, or to care about whether coins land heads or tails, but that just means we have to drag in causal­ity—your guess doesn’t af­fect how math works, or flip coins over.

So the stan­dard pro­ce­dure for our robot looks like this:

Now, face a prob­lem. This prob­lem will have some out­comes (pos­si­ble num­bers of cook­ies), some op­tions (that is, strate­gies to fol­low, like choos­ing one of 10 pos­si­ble digits), and any amount of in­for­ma­tion about how op­tions cor­re­spond to out­comes (like “iff the trillionth prime ends with this digit, you get the cookie”).

Now our robot calcu­lates the limited-re­sources prob­a­bil­ity of get­ting differ­ent out­comes given differ­ent strate­gies, and from that calcu­lates an ex­pected util­ity for each strat­egy.

Our robot then fol­lows one of the strate­gies with max­i­mum ex­pected util­ity.

Bonus ex­er­cises: Does this pro­ce­dure already han­dle prob­a­bil­is­tic maps from the op­tions to the out­comes, like in the case of the flipped coin? How about if flip­ping a coin isn’t already con­verted into a prob­a­bil­ity, but is left as an un­der­de­ter­mined prob­lem a la “a coin (heads XOR tails) is flipped, choose one.”

• Thanks for writ­ing this, I found it helpful for un­der­stand­ing the mo­ti­va­tion be­hind Benja’s pro­posal and to de­velop some in­tu­itions about it. I’d like to see some ex­pla­na­tion of how state­ments in­volv­ing quan­tifiers are sup­posed to work in the pro­posal. Here’s my un­der­stand­ing, and please let me know if it’s cor­rect.

A state­ment with a “for all” quan­tifier is also sup­posed to start with 12 prob­a­bil­ity. Con­sider “For all X, Q(X)” for some equa­tion Q. This im­plies Q(0) and Q(1), etc., each of which also starts with 12 prob­a­bil­ity. What hap­pens as the robot proves in­di­vi­d­ual im­pli­ca­tions of the form “(For all X, Q(X)) im­plies Q(0)”? The prob­a­bil­ity of the quan­tified state­ment is re­duced by half each time. Prov­ing that in­di­vi­d­ual state­ments like Q(0) are true can only bring the prob­a­bil­ity of the quan­tified state­ment back up to 12 but not above that.

Since prov­ing Q(i) tends to be harder (take longer) than prov­ing “(For all X, Q(X)) im­plies Q(i)”, prob­a­bil­ity of the quan­tified state­ment goes to 0 as we in­crease the time/​length limit even if all of the in­di­vi­d­ual Q(i) are prov­able. This doesn’t seem to be de­sir­able, if my un­der­stand­ing is cor­rect.

• Oh, since I was re­cently think­ing about this, I figured I’d write down how the robot ac­tu­ally ends up rais­ing prob­a­bil­ities above 12 (with­out just go­ing to 1): if we can make the prob­a­bil­ity of Q go down, then we just have to make the prob­a­bil­ity of ¬Q go down the same way. The prob­a­bil­ity of Q goes down when we learn that Q im­plies X but we don’t see X. The prob­a­bil­ity of Q goes up when we learn that ¬Q im­plies Y but we don’t see Y.

• Hm. Yeah, you’re right. The prob­a­bil­ity goes like 1/​(1+2^num­ber of things im­plied). You’d think that “3 is prime, 5 is prime, 7 is prime” would at least make the robot guess that all odd num­bers were prime.

On the other hand, if we think of all pos­si­ble state­ments “for all X, Q(X)”, there are more was to be false than true. In­finity more ways, even, and this is di­rectly re­lated to how many things are im­plied by the state­ment. So in that way, the robot is func­tion­ing ex­actly as in­tended and as­sign­ing the max­i­mum en­tropy value.

Which makes this prob­lem sort of like my sus­pi­cions about magfrump’s idea be­low—re­lated to the fact that we ex­pect sim­ple things, we don’t ac­tu­ally ex­pect max­i­mum en­tropy—and we’re of­ten right.

• I thought of it some more… Don’t think of prob­a­bil­ity as a way to rep­re­sent ig­no­rance. Prob­a­bil­ity rep­re­sents a kind of knowl­edge. For ex­am­ple when you say “prob­a­bil­ity that coin fell heads is 12 ”, that rep­re­sents a sort of a model of what bounces do to the coin di­rec­tion. When you mul­ti­ply prob­a­bil­ities for a pair of coin throws, that rep­re­sents in­de­pen­dence of the throws. The 12 for Q(i) is not rep­re­sen­ta­tive of that, and even if it was rep­re­sen­ta­tive of some­thing (statis­tics on the Q), product of these num­bers is not prob­a­bil­ity but lower bound on the prob­a­bil­ity as­sum­ing in­de­pen­dence; as­sum­ing they aren’t in­de­pen­dent, the prob­a­bil­ity is still 12 .

Prob­a­bil­is­tic meth­ods cer­tainly are very use­ful in time bounded calcu­la­tions—they are used very heav­ily in com­puter graph­ics and simu­la­tions of all kinds—but that is usu­ally through use of ran­dom num­ber source or suffi­ciently good ran­dom num­ber gen­er­a­tor, as in the prob­a­bil­is­tic pri­mal­ity tests such as Miller-Rabin for ex­am­ple. The time-bounded bot, far from eval­u­at­ing and up­dat­ing “prob­a­bil­ities”, would have to do sym­bolic math­e­mat­ics search­ing for an im­perfect solu­tion, and make use of ran­dom num­ber sources when­ever ap­pro­pri­ate, striv­ing for bounds on prob­a­bil­ity of be­ing wrong. (e.g. when I im­ple­mented Miller-Rabin I just made it run enough calcu­la­tions so that it wouldn’t fail till heat death of uni­verse as­sum­ing that my prng wouldn’t some­how re­turn a huge string of bad wit­nesses, which would be very in­ter­est­ing if it hap­pened).

• I thought of it some more… Don’t think of prob­a­bil­ity as a way to rep­re­sent ig­no­rance. Prob­a­bil­ity rep­re­sents a kind of knowl­edge.

I’ll be spe­cific. Shan­non en­tropy rep­re­sents ig­no­rance. The big­ger it is, the more ig­no­rant you are.

• Well, that doesn’t seem very use­ful as per Wei Dai ex­am­ple...

Why not look at the meth­ods peo­ple ac­tu­ally em­ploy to ap­prox­i­mate and guess math when they can’t quite com­pute some­thing? Ap­plied math­e­mat­ics is an enor­mously huge field.

• On the other hand, if we think of all pos­si­ble state­ments “for all X, Q(X)”, there are more was to be false than true. In­finity more ways, even,

If you con­sider all syn­tac­ti­cally valid Q in some sort of math no­ta­tion, no mat­ter the length, there’s the frac­tion that is sim­ply 0=0*( ….....… X some­where here........) , or your favourite less triv­ial state­ment of choice. Ditto for Tur­ing ma­chine tapes et cetera. There’s cer­tainly a nonzero prob­a­bil­ity of con­struct­ing Q(X) that holds for all X.

• That is a more de­tailed model than the robot uses. What it means by the ways to be false or true is more like

“true for 1, true for 2, true for 3, false for 4, true for 5...” The robot can’t look in­side the state­ments while it’s do­ing prob­a­bil­is­tic logic, it can only look at truth val­ues and re­la­tion­ships.

On the other hand, the power of do­ing that is cer­tainly a good rea­son to up­grade the robot :)

• This is coun­ter­in­tu­itive in an in­ter­est­ing way.

You’d think that since P(Q1|~∀xQx) = 12 and P(Q1|∀xQx) = 1, ob­serv­ing Q1 is ev­i­dence in fa­vor of ∀xQx.

And it is, but the hid­den catch is that this de­pends on the im­pli­ca­tion that ∀xQx->Q1, and that im­pli­ca­tion is ex­actly the same amount of ev­i­dence against ∀xQx.

It’s also an amus­ing an­swer to the end of part 1 ex­er­cise.

• I feel like the ex­am­ple of digits of prime num­bers is a lit­tle bit lead­ing here, be­cause prime num­bers are ac­tu­ally equidis­tributed in a tech­ni­cal sense, and suffi­ciently firmly that if you pre­tend that ev­ery­thing works like prob­a­bil­ities, you end up get­ting the right an­swers and pro­duc­ing deep the­o­ries.

There are other situ­a­tions in which there aren’t naive prob­a­bil­is­tic mod­els which are ac­tu­ally highly pre­dic­tive, and even if it’s clear to me how to trans­late your rea­son­ing into the spe­cific ex­am­ple pre­sented, it’s not clear to me how to trans­late it to some­thing about, say, or­ders of ex­cep­tional finite sim­ple groups or smooth struc­tures on real vec­tor spaces, which are both amenable to prob­a­bil­is­tic an­swers but have an­swers that don’t look like prob­a­bil­ity dis­tri­bu­tions.

• prime num­bers are ac­tu­ally equidis­tributed in a tech­ni­cal sense

The last digit of the trillionth prime num­ber is 3, which is about as non-equidis­tributed as you can get. That’s the right an­swer. Any­thing else is just a way for the robot to con­fess its ig­no­rance grace­fully.

EDIT: al­though one trou­ble might be that this robot isn’t equipped to han­dle im­proper dis­tri­bu­tions. So if you hand it an in­fini­tude of finite sim­ple groups and tell it to choose one, it as­signs ev­ery­thing a prob­a­bil­ity of zero and chooses ac­cord­ing to what­ever al­gorithm it uses to choose be­tween things that have iden­ti­cal util­ity.

• Yes but since “trillionth prime” isn’t com­putable, the ques­tion trans­lates to “some high prime” and among the set of odd primes, the ones digits are equidis­tributed over the set {1,3,7,9}. I agree with the point that this type of ques­tion can be at­tacked as though it were a prob­a­bil­is­tic ques­tion.

My point is that heuris­tic rea­son­ing WORKS on primes, this is a the­o­rem. If you say “I don’t know the an­swer, but I will use heuris­tic rea­son­ing to make a guess” and you ask for the ones digits of the trillionth, trillion and third, and two-trillion-and-sev­en­teenth primes, you ex­pect to get three num­bers picked at ran­dom from that set, and guess­ing this way will serve you well.

There are other ques­tions where you might think to ap­ply heuris­tic rea­son­ing, such as the ones digit of the num­ber of iso­mor­phism classes of finite groups of a given in­te­ger or­der, which like the ques­tion about primes is a func­tion from the nat­u­ral num­bers to num­bers 0-9, but I do not be­lieve that it gives you a func­tion which any­thing rea­son­able is known about; it doesn’t con­verge to a dis­tri­bu­tion on the set of digits, it just acts weird and changes dras­ti­cally at un­pre­dictable points.

• One can make a slightly bet­ter guess, though. The num­ber of digits in the trillionth prime can be found via some ap­prox­i­mate prime count­ing rule, then the dis­tri­bu­tion of sums of all digits but last can be es­ti­mated, then the di­visi­bil­ity-by-3 rule over sum of all digits makes the last digits very slightly not equiprob­a­ble with re­gards to di­visi­bil­ity by 3. If you use alge­bra clev­erly and avoid any un­nec­es­sary com­pu­ta­tions (such as com­pu­ta­tion of ac­tual ‘prob­a­bil­ities’ when­ever those are un­nec­es­sary, try­ing to an­swer ques­tion of A>B alge­braically rather than ar­ith­meti­cally), you might be able to do that in time.

edit: or much more sim­ply, if you have an up­per bound and a lower bound on the value of the prime, then within this range the last digits are not equidis­tributed. Albeit it feels to me that the choice of the last digit will de­pend en­tirely to which method comes to mind of our agent, and in turn, the agent will ap­pear ir­ra­tional (prim­ing—the choice of the digit will de­pend to what ideas the agent has available), and a mal­i­cious third party that knows the an­swer and wants our agent de­prived of cook­ies, could talk the agent into mak­ing a wrong guess.

edit: this is quite in­ter­est­ing topic, ac­tu­ally. If you know that differ­ent ap­prox­i­ma­tion meth­ods will yield differ­ent last digits, and you as­sume that the choice of an ap­prox­i­ma­tion method is ran­dom, then you should fur­ther scale down the util­ity differ­ence to ap­prox­i­mate the sum (av­er­age) over all meth­ods.

Though, I don’t see any­thing be­tween a very very slightly bet­ter guess, and the cor­rect an­swer. And if the very very slightly bet­ter guess doesn’t re­turn 3, you are guaran­teed to lose.

• If you try to count primes IN ANY WAY, then you will end up with the an­swer that YOU HAVE AN EQUAL CHANCE OF GETTING ANY COPRIME RESIDUE MODULO 10; this is a THEOREM. It is a proven fact of na­ture, or pla­tonic ideals, or what­ever it is the­o­rems are facts of. Mea­sure mod 3 will just tell you that the resi­dues are also equidis­tributed mod­ulo 3; it will re­duce to com­put­ing resi­dues mod­ulo 30, which will not give you any in­for­ma­tion what­so­ever be­cause there will be an equal num­ber that re­duce to 1,3,7, and 9 mod­ulo 10.

If you have an up­per and lower bound ob­vi­ously primes aren’t EXACTLY equidis­tributed, but in this case your up­per and lower bounds will be away by mil­lions, with primes show­ing up on av­er­age ev­ery 27 digits. The prob­a­bil­ities will work out, if you think of this as a ran­dom ques­tion se­lected from many pos­si­ble and in­dis­t­in­guish­able ques­tions.

In real life if I ask you “how tall is the tallest Se­quoia tree?” ver­sus “how tall is the tallest scott’s pine?” Even though both ques­tions have spe­cific an­swers, if you don’t have wikipe­dia handy, you’ll need to de­fault to “how tall do ev­er­greens get” and the an­swer be­comes a prob­a­bil­ity dis­tri­bu­tion. The same is true of prime num­bers, once you blur out the de­tails as much as they already have been blurred out. So the “cor­rect an­swer” in a Bayesian sense is the well-cal­ibrated an­swer, where you have a dis­tri­bu­tion and con­fi­dence in­ter­vals that are of known ac­cu­racy over many similar ques­tions, even if you get the wrong an­swer some­times. This sort of rea­son­ing is to­tally trans­par­ent with re­spect to one spe­cific math­e­mat­i­cal phe­nomenon; the dis­tri­bu­tion of prime num­bers in a range.

It may not be trans­par­ent with re­spect to other math­e­mat­i­cal prob­lems.

• Equidis­tri­bu­tion, of course, doesn’t im­ply “equal chances” for Nth prime. It’s 3. If you don’t have time to calcu­late 3, there’s no the­o­rem say­ing you can’t con­clude some­thing (triv­ially, in edge cases where you have al­most enough time you might be able to ex­clude 1 some­how). Other in­ter­est­ing thing is that 2^n-1 are rarely primes (n has to be prime and then still usu­ally not), and we know that 2^n-1 don’t end in 9 ; im­pact of this de­creases as n grows though. All sorts of small sub­tle things go­ing on. (And yes, the re­sult­ing differ­ence in “prob­a­bil­ities” is very small).

• If you were able to con­clude ad­di­tional in­for­ma­tion about the one trillionth prime OTHER THAN that it is a large prime num­ber which is ap­prox­i­mately of size 27 trillion, then that in­for­ma­tion MAY con­tain in­for­ma­tion about that prime mod­ulo 10, I agree.

I would be very sur­prised (p = .01) if there ac­tu­ally are re­sults of the form “the nth prime has such and such char­ac­ter­is­tics mod­ulo what­ever” be­cause of the strong equidis­tri­bu­tion re­sults that do ex­ist, and ob­vi­ously af­ter some­one com­putes the an­swer it is known. If you look at prime num­bers and ap­ply tech­niques from prob­a­bil­ity the­ory, it does work, and it works beau­tifully well. Ad­ding in­for­ma­tion be­yond “high prime” would al­low you to ap­ply tech­niques from prob­a­bil­ity the­ory to deal with log­i­cal un­cer­tainty, and it would work well. The ex­am­ples you give seem to be ex­am­ples of this.

My core point is that other prob­lems may be less sus­cep­ti­ble to ap­proaches from prob­a­bil­ity the­ory. Even if look­ing at digits of prime num­bers is not a prob­a­bil­ity the­ory prob­lem, us­ing tech­niques from prob­a­bil­ity the­ory ac­cesses in­for­ma­tion from the­o­rems you may not have proven. Us­ing tech­niques from prob­a­bil­ity the­ory el­se­where does not do this, be­cause those the­o­rems you haven’t proven aren’t even true. So I am con­cerned, when some­one pur­ports to solve the prob­lem of log­i­cal un­cer­tainty, that their ex­am­ple prob­lem is one whose solu­tion looks like nor­mal un­cer­tainty.

I don’t think we dis­agree on any mat­ters of fact; I think we may dis­agree about the defi­ni­tion of the word “prob­a­bil­ity.”

• Yes, I agree. In other prob­lems, your “prob­a­bil­ities” are go­ing to not be statis­ti­cally in­de­pen­dent from ba­sic facts of math­e­mat­ics. I my­self posted a top level com­ment with re­gards to ul­ti­mate fu­til­ity of ‘prob­a­bil­is­tic’ ap­proach. Prob­a­bil­ity is not just like logic. If you have graph with loops or cy­cles, it is in­cred­ibly ex­pen­sive. It isn’t some re­als flow­ing through net­work of tubes, sides of a loop are not statis­ti­cally in­de­pen­dent. It doesn’t cut down your time at all, ex­cept in ex­treme ex­am­ples (I once im­ple­mented a cryp­to­graphic al­gorithm de­pen­dent on Miller–Rabin pri­mal­ity test, which is “prob­a­bil­is­tic”, and my un­der­stand­ing is that this is com­mon in cryp­tog­ra­phy and is used by your browser any time you es­tab­lish a SSL con­nec­tion)

• Okay yes we are in agree­ment.

• I think you may be mix­ing prob­a­bil­ity with fre­quency here.

If you say “I don’t know the an­swer, but I will use heuris­tic rea­son­ing to make a guess” and you ask for the ones digits of the trillionth, trillion and third, and two-trillion-and-sev­en­teenth primes, you ex­pect to get three num­bers picked at ran­dom from that set, and guess­ing this way will serve you well.

You should rather say that they’re in­de­pen­dent if you’re com­pu­ta­tion­ally limited. But that in­de­pen­dence doesn’t even mat­ter for this post. I think you’re as­sign­ing too many things to the word “ran­dom.” Prob­a­bil­ity is not about ran­dom, it’s about ig­no­rance.

EDIT: speak­ing of ig­no­rance, I prob­a­bly made this com­ment in ig­no­rance of what you ac­tu­ally in­tended, see more re­cent com­ment.

• Whoops. Well, I may have figured out what you mean. But just in case, would you be will­ing to try and ex­plain what sort of ex­tra rea­son­ing the robot should be do­ing in cer­tain cases as if I had no idea what you meant?

• What I’m say­ing is that the robot will ac­tu­ally do well in the given ex­am­ple, be­cause when you use Bayesian in­fer­ence on primes it works for some rea­son.

Other ques­tions in math have weird an­swers; for ex­am­ple, for ev­ery nat­u­ral num­ber n, the num­ber of smooth struc­tures on R^n is ex­actly 1… ex­cept for n=4, in which case there are un­countably many.

Prob­a­bil­ity dis­tri­bu­tions can con­tin­u­ously ap­prox­i­mate this, but only with a lot of difficulty and in­ac­cu­racy. It’s an an­swer that isn’t go­ing to nicely ap­pear out of do­ing Bayesian think­ing. Once an agent has seen one di­men­sion with an in­finite num­ber, they’ll be hard pressed to ac­cept that no other di­men­sion has more than 1… or see­ing an in­finite num­ber with 1, how will they ac­cept one with in­finitely many? Ba­si­cally your pri­ors have to be weirdly for­mal­ized. This may be a prob­lem that has been ad­dressed.

So my point is that just “throw Bayes at it” is a to­tally rea­son­able way of mak­ing prob­a­bil­is­tic es­ti­mates for the­o­rems about countable and com­pact sets. It’s just that that method of rea­son­ing will vary wildly in how use­ful it is, be­cause some math­e­mat­i­cal re­sults look like prob­a­bil­ities and some don’t.

I don’t have idea for ex­tra rea­son­ing pro­cesses I just want to urge cau­tion that the prob­lem isn’t solved be­cause it can han­dle some situ­a­tions.

• Hm. I think you think the robot works differ­ently than it ac­tu­ally does.

My guess for what you were go­ing to say was that you wanted the robot to work in this differ­ent way, but nopes.

A robot that pre­dicts the next num­ber in a se­ries is some­thing like a min­i­mum mes­sage-length pre­dic­tor. The robot out­lined here ac­tu­ally can’t do that.

In­stead, the robot at­tacks each prob­lem based on the­o­rems that re­quire or rule out differ­ent com­bi­na­tions of rele­vant pos­si­bil­ities. So, for ex­am­ple, in the ques­tion “how many smooth struc­tures are there in n di­men­sions” (thank you for mak­ing that ex­am­ple clear btw), the robot would for each sep­a­rate case try to prove what was go­ing on, and if it failed, it would likely do some­thing dumb (like run out of time, or pick the first thing it hadn’t proven a the­o­rem about) be­cause it doesn’t han­dle in­finity-op­tion prob­lems very well. What it wouldn’t do is try and pre­dict the an­swer based on the an­swer for smaller di­men­sions, un­less it could prove the­o­rems con­nect­ing the two.

• Okay, but say the robot in ten sec­onds man­ages to prove the the­o­rem: “For all di­men­sions n not equal to 4, there is ex­actly one smooth struc­ture on R^n.” But is un­able to pro­duce a re­sult re­gard­ing n=4? Or more re­al­is­ti­cally, say it is able to con­struct 1 smooth struc­ture on R^n for any n, and can prove that no oth­ers ex­ist un­less n=4. How does it make guesses in the case n=4?

• If we want to live in the least con­ve­nient pos­si­ble world as­sume that in sec­ond 1 it con­structs the smooth struc­tures; it takes three sec­onds to prove that there are no more for n>5, three sec­onds to prove no more for n=1,2, and three more sec­onds to prove there are no more for n=3 and runs out of time. Th­ese re­sults are ob­tained in­ci­den­tally from in­equal­ities that arise when pur­su­ing a proof for n=4, which is the cen­tral value of some equa­tion at the core of the proofs. (so the proofs re­ally say “if an­other smooth struc­ture ex­ists, it ex­ists for n<5, 2<n<5, 3<n<5.”)

• If it re­ally can’t prove any the­o­rems that di­rectly in­clude the trans­la­tion of “the num­ber of smooth struc­tures for n=4 is,” it sim­ply won’t ever up­date that.

• Well it can prove “the num­ber of struc­tures for n=4 is at least 1.”

• one trou­ble might be that this robot isn’t equipped to han­dle im­proper dis­tri­bu­tions. So if you hand it an in­fini­tude of finite sim­ple groups and tell it to choose one, it as­signs ev­ery­thing a prob­a­bil­ity of zero and chooses ac­cord­ing to what­ever al­gorithm it uses to choose be­tween things that have iden­ti­cal util­ity.

I don’t think this is too hard to fix. The robot could have a unit of im­proper prior (ep­silon) that it re­mem­bers is larger than zero but smaller than any pos­i­tive real.

Of course, this doesn’t tell you what to do when asked to guess whether the or­der of the cor­rect finite sim­ple group is even, which might be a pretty big draw­back.

• For the robot as de­scribed, this will ac­tu­ally hap­pen (sort of like Wei Dai’s com­ment—I’m learn­ing a lot from dis­cussing with you guys :D ) - it only ac­tu­ally low­ers some­thing’s prob­a­bil­ity once it proves some­thing about it speci­fi­cally, so it just low­ers the prob­a­bil­ity of most of its in­finite op­tions by some big ex­po­nen­tial, and then, er, runs out of time try­ing to pick the op­tion with high­est util­ity. Okay, so there might be a small flaw.

• This is cool, but seems un­der­speci­fied. Could you write a pro­gram that car­ries out this rea­son­ing with re­spect to a fairly broad class of prob­lems?

If you have an in­con­sis­tent prob­a­bil­ity dis­tri­bu­tion, the re­sult you get when you re­solve them de­pends on the or­der in which you re­solve them. For ex­am­ple, given P(A)=1/​2. P(B)=1/​2, P(AB)=1/​2, and P(A¬B)=1/​2, you could re­solve this by de­riv­ing P(A¬B)=0 from the first 3, or by de­riv­ing P(AB)=0 from the first 2 and last 1. Both of these an­swers seem ob­vi­ously wrong. Does your method have a con­sis­tent way of re­solv­ing that sort of prob­lem?

• A nice way of think­ing about it is that the robot can do un­limited prob­a­bil­is­tic logic, but it only takes finite time be­cause it’s only work­ing from a finite pool of proven the­o­rems. When do­ing the prob­a­bil­is­tic logic, the state­ments (e.g. A, B) are treated as atomic. So you can have effec­tive in­con­sis­ten­cies, in that you can have an atom that says A, and an atom that says B, and an atom that effec­tively says ‘AB’, and un­luck­ily end up with P(‘AB’)>P(A)P(B). But you can’t know you have in­con­sis­ten­cies in any way that would lead to math­e­mat­i­cal prob­lems. Once you prove that P(‘AB’) = P(AB), where re­mov­ing the quotes means break­ing up the atom into an AND state­ment, then you can do prob­a­bil­is­tic logic on it, and the max­i­mum en­tropy dis­tri­bu­tion will no longer be effec­tively in­con­sis­tent.

• Oh, I see. Do you know whether you can get differ­ent an­swers by at­om­iz­ing the state­ments differ­ently. For in­stance, will the same in­for­ma­tion always give the same re­sult­ing prob­a­bil­ities if the atoms are A and B as it would if the atoms are A and A-xor-B?

P(‘AB’)>P(A)P(B)

Not a prob­lem if A and B are cor­re­lated. I as­sume you mean P(‘AB’)>min(P(A), P(B))?

• Ah, right. Or even P(‘AB’)>P(A).

You can’t get differ­ent prob­a­bil­ities by at­om­iz­ing things differ­ently, all the atoms “already ex­ist.” But if you prove differ­ent the­o­rems, or the­o­rems about differ­ent things, then you can get differ­ent prob­a­bil­ities.

• A more effec­tive robot only looks for the most prob­a­ble digit, it doesn’t need to know how prob­a­ble that digit is (edit: that is, in this spe­cific prob­lem where util­ities are equal. There is noth­ing ad-hoc about use of alge­bra). E.g. if it would figure out that for some rea­son the large primes are most likely to end in 3 (or rather, that no digit is more likely than 3), it does not even need to know that 0,2,4,5,6,8 are not an op­tion.

Fur­ther­more, as you lower the prime, there has to be a seam­less tran­si­tion to the robot ac­tu­ally calcu­lat­ing the last digit in an effi­cient man­ner.

A bounded agent has to ra­tio­nally al­lo­cate it’s com­put­ing time. Calcu­lat­ing di­rect prob­a­bil­ities or huge sets of re­la­tional prob­a­bil­ities—rather than min­i­mum nec­es­sary—is among the least ra­tio­nal things that can be done with the com­put­ing time. (Even worse thing to do could be a com­par­i­son be­tween two es­ti­mates of util­ity which have differ­ent er­rors, such as two in­com­plete sums). Out­side very sim­ple toy prob­lems, prob­a­bil­is­tic rea­son­ing is in­cred­ibly com­pu­ta­tion­ally ex­pen­sive, as of­ten ev­ery pos­si­ble com­bi­na­tion of vari­ables has to be con­sid­ered.

• A more effec­tive robot only looks for the most prob­a­ble digit

I agree with the rest of your com­ment, but this seems too ad hoc. It runs into trou­ble if out­comes differ in util­ity, so that you can’t just look for high prob­a­bil­ity. And stor­ing a num­ber seems like a much bet­ter way of in­te­grat­ing lots of in­de­pen­dent pieces of in­for­ma­tion than stor­ing a list.

• Then you look for largest prob­a­bil­ity*util­ity , which you gen­er­ally do by try­ing to find a way to demon­strate A>B which you can do in many cases where you can’t ac­tu­ally eval­u­ate ei­ther A or B (and many cases where you can only eval­u­ate A and B so in­ac­cu­rately that out­come of com­par­i­son of eval­u­a­tions of A and B is pri­mar­ily de­pen­dent on in­ac­cu­ra­cies).

Fur­ther­more, a “prob­a­bil­ity” is a list due to loss of statis­ti­cal in­de­pen­dence with other vari­ables. edit: the pieces of in­for­ma­tion are very rarely in­de­pen­dent, too. Some rea­son­ing that 3 is more likely than other digits would not be in­de­pen­dent from 2 be­ing a bad choice.

edit: also, holy hell, trillionth prime does end with 3.