A basis for pattern-matching in logical uncertainty

Pre­vi­ous log­i­cal un­cer­tainty robot de­signs (e.g. here, here, here) have re­lied on prov­ing the­o­rems that re­late to the state­ment (e.g. “the trillionth digit of pi is 5”) un­der in­spec­tion (where a use­ful the­o­rem would be e.g. “there are 10 pos­si­ble mu­tu­ally ex­clu­sive digits [1,2,3,4,5,6,7,8,9,0] the trillionth digit of pi could be.”). This is nice. But it doesn’t do pat­tern-match­ing—if you tell a the­o­rem-prov­ing robot that a state­ment is true for the first thou­sand in­te­gers, it doesn’t even sus­pect that the state­ment might be true for the 1001st too. The state­ments are just in­de­pen­dent black boxes.

In or­der to go fur­ther, our robot needs to peek in­side the black box. But how, and why should it see pat­terns?

We tell our robot these facts: “3 is ‘odd’. 5 is ‘odd’. 7 is ‘odd’. 11 is ‘odd’.”

The robot asks “Could you just tell me what this con­cept ‘odd’ means ex­plic­itly? I don’t know English very well.”

“No,” we say. “You’ll have to guess.”

For robots that don’t make up in­for­ma­tion out of thin air, guesses about ‘odd’-ness will obey the max­i­mum-en­tropy prin­ci­ple, which roughly means “give ev­ery­thing equal prob­a­bil­ity un­til you learn more.”

“Well, robot, what do you think is the prob­a­bil­ity that 9 is ‘odd’, given what we’ve told you?”

“50 per­cent. Beep boop.”

“But all those other things were odd, weren’t they?”

“Those other things were not the num­ber 9. Imag­ine go­ing num­ber by num­ber and la­bel­ing each num­ber as ‘odd’ or ‘not odd’. There are equal num­bers of pos­si­ble la­bel­ings with 9 odd and 9 not odd, re­gard­less of what those other num­bers are. Since by the prin­ci­ple of max­i­mum en­tropy I start out with all la­bel­ings equally likely, 9 has a 50% chance of be­ing ‘odd’.”

“But isn’t it just sen­si­ble to think that the next num­ber we give you will be ‘odd’ if all the oth­ers were? What of the ap­peal of sim­plic­ity?”

“What’s so great about sim­plic­ity? I’m try­ing not to make up in­for­ma­tion out of thin air here.”

What do we tell our robot to make it guess that 9 is prob­a­bly odd? Well, we want it to think that pat­terns of odd-ness that are sim­pler are more likely. So how about we let the robot peek at our defi­ni­tion of ‘odd’, just long enough to see that for ev­ery in­te­ger we can test if it’s odd, and how com­pli­cated the test is.

Even if our robot has no clue what the con­stituent sym­bols were (or, more to the point, not enough pro­cess­ing power to fully ex­plore the ram­ifi­ca­tions in a more difficult and re­al­is­tic sce­nario), know­ing the mere num­ber of char­ac­ters defin­ing ‘odd’ puts an up­per bound on the Kol­mogorov com­plex­ity of how the num­bers are la­beled. Heck, even just learn­ing that we can write down the odd­ness-test is huge, since there are an in­finite num­ber of in­finite-com­plex­ity rules.

Once the robot knows that the com­plex­ity of ‘odd’ is be­low a cer­tain small­ish value, low-com­plex­ity hy­pothe­ses like “all num­bers are ‘odd’” and “odd num­bers are ‘odd’” start to out­weigh1 big­ger hy­pothe­ses like “all num­bers ex­cept {long list in­clud­ing 9} are ‘odd’.” Th­ese sim­ple hy­pothe­ses con­tain just the sorts of pat­terns we some­times want the robot to see, like ‘odd’-ness.

We tell our robot these facts: “3 is odd. 5 is odd. 7 is odd. 11 is odd. A num­ber is ‘odd’ if a 14-char­ac­ter pred­i­cate is true.”

The robot asks “Could you just tell me what this con­cept ‘odd’ means ex­plic­itly?”

“No,” we say. “You’ll have to guess. What do you think is the prob­a­bil­ity that 9 is ‘odd’, given what we’ve told you?”

“65.1 per­cent.”

“Hah, got you! When we said ‘odd’ we se­cretly meant prime!”

I sus­pect that this kind of peek­ing han­dles most of the cases where we want some­thing like pat­tern-match­ing (speci­fi­cally, min­i­mum-mes­sage-length pre­dic­tion of pat­terns) in log­i­cal un­cer­tainty. The ob­vi­ous un-han­dled case—the choice of ax­ioms, or log­i­cal sys­tems, or that sort of thing, seems more like model un­cer­tainty than log­i­cal un­cer­tainty—that is, the ques­tion is not re­ally “is sec­ond-or­der logic true,” it’s “does sec­ond-or­der logic cor­re­spond to the world I want to model,” which is be­yond the scope of this Dis­cus­sion ar­ti­cle.

1 How do you turn the num­ber of sym­bols we wrote it down with into a dis­tri­bu­tion over pos­si­ble rules? It’s eas­iest to just max­i­mum-en­tropy all valid rules with the cor­rect num­ber of sym­bols. Since many rules can be com­pressed, the set of rules with some num­ber of sym­bols is smeared over lower pos­si­ble Kol­mogorov com­plex­ities, I think with an ex­po­nen­tial prefer­ence to­wards lower com­plex­ity. But on the other hand, hu­mans don’t hand out max­i­mum-en­tropy sam­ples of defi­ni­tions—we’ll need prob­a­bil­is­tic logic to do prob­a­bil­is­tic logic. (Yo dawg, I heard you liked re­cur­sion, so we put this joke in it­self so you can [this joke] while you [this joke].)

• The com­puter said 65%. That’s 35% to cover stuff like ‘prime’, hardly an em­bar­rass­ing failure.

• Yeah the robot’s pretty great.

• We tell our robot these facts: “3 is ‘odd’. 5 is ‘odd’. 7 is ‘odd’. 11 is ‘odd’.” … “Well, robot, what do you think is the prob­a­bil­ity that 9 is ‘odd’, given what we’ve told you?”

Con­sider a par­allel case:

We tell our robot these facts: “3 is ‘odd’. 5 is ‘odd’. 7 is ‘odd’. 11 is ‘odd’.” … “Well, robot, what do you think is the prob­a­bil­ity that 10 is ‘odd’, given what we’ve told you?”

The sim­plest hy­poth­e­sis is that “odd” is just a syn­onym for “a nat­u­ral num­ber”.

• Yup, that is the sim­plest hy­poth­e­sis. In­ter­est­ingly, the next sim­plest by many lights are ac­tu­ally the ex­clu­sion of a sin­gle sim­ple num­ber. Then we get to al­ter­nat­ing num­bers, and soon af­ter that we get enough space to ex­plode the num­ber of pos­si­bil­ities be­yond a mod­ern com­puter’s abil­ity to keep track of the im­pli­ca­tions, ne­ces­si­tat­ing us to call log­i­cal un­cer­tainty meth­ods in­side our log­i­cal un­cer­tainty meth­ods (yo dawg, I heard you like re­cur­sion...).

• Note that there are cur­rently sys­tems which can con­struct con­jec­tures. See the dis­cus­sion here. Also, PAC learn­ing can do things similar to what you are talk­ing about here.

• Rec­og­niz­ing some com­mon char­ac­ter­is­tics of ob­jects to be placed in the not ‘odd’ bin would also lower the up­per bound on the com­plex­ity.

• Could you ex­plain?

• “3 is odd. 5 is odd. 7 is odd. 11 is odd. A num­ber is ‘odd’ if a 14-char­ac­ter pred­i­cate is true.”

like, ’Oh, and also 4 is not odd, nor is 8, nor 10.”

If you want to rule out ‘prime’, you can add ‘2’ to that list.

• Sure, will you take some python code as an ex­am­ple? I had to re­place spaces with pe­ri­ods, the ver­ba­tim for­mat­ting doesn’t seem to take into ac­count python in­dented by 4 spaces.

Without tak­ing into nega­tive train­ing data into ac­count:

``````pos­si­ble_prop­er­ties.=.[]
for.p.in.Ob­ject.prop­er­ties():
….for.x.in.train­ing_set:
….....if.not.x.has_prop­erty(p):
….........break
….pos­si­ble_prop­er­ties.ap­pend(p)
``````

Tak­ing nega­tive train­ing data into ac­count, here we have a ‘pos­i­tive set’, and a ‘nega­tive set’:

``````ir­rele­vant_prop­er­ties.=.[]
for.x.in.nega­tive_set:
….for.prop­erty.in.x.prop­er­ties():
….....ir­rele­vant_prop­er­ties.ap­pend(prop­erty)

rele­vant_prop­er­ties.=.[]
for.p.in.Ob­ject.prop­er­ties():
….for.x.in.pos­i­tive_set:
…....if.not.x.has_prop­erty(p).or.p.in.ir­rele­vant_prop­er­ties:
….........break
….rele­vant_prop­er­ties.ap­pend(p)
``````

See the differ­ence? In the sec­ond case, ‘po­ten­tial prop­er­ties’ is smaller. Note that this is not an op­ti­mal solu­tion, since it looks up all pos­si­ble prop­er­ties in or­der to find the com­mon prop­er­ties of a train­ing set, I wrote it be­cause it’s a lit­tle more suc­cinct than in­ter­sec­tions.

• Abram Dem­ski’s sys­tem does ex­actly this if you take his prob­a­bil­ity dis­tri­bu­tion and up­date on the state­ments “3 is odd”, “5 is odd”, etc. in a Bayesian man­ner. That’s be­cause his dis­tri­bu­tion as­signs a rea­son­able prob­a­bil­ity to state­ments like “odd num­bers are odd”. Up­dat­ing gives you rea­son­able up­dates on ev­i­dence.

• His dis­tri­bu­tion also as­signs a “rea­son­able prob­a­bil­ity” to state­ments like “the first 3^^^3 odd num­bers are ‘odd’, then one isn’t, then they go back to be­ing ‘odd’.” In the low com­put­ing power limit, these are as­signed very similar prob­a­bil­ities. Thus, if the first 3^^^3 odd num­bers are ‘odd’, it’s kind of a toss-up what the next one will be.

Do you dis­agree? If so, could you use math in ex­plain­ing why?

• What is “the low com­put­ing power limit”? If our the­o­ries be­have badly when you don’t have com­put­ing power, that’s un­sur­pris­ing. Do you mean “the large com­put­ing power limit”.

I think prob­a­bil­ity ( “the first 3^^^3 odd num­bers are ‘odd’, then one isn’t, then they go back to be­ing ‘odd’.” ) /​ prob­a­bil­ity (“all odd num­bers are ‘odd’”) is ap­prox­i­mately 2^(length of 3^^^3) in Abram’s sys­tem, be­cause the prob­a­bil­ity of them ap­pear­ing in the ran­dom pro­cess is sup­posed to be this ra­tio. I don’t see any­thing about the ran­dom pro­cess that would make the first one more likely to be con­tra­dicted be­fore be­ing stated than the sec­ond.

• What is “the low com­put­ing power limit”? If our the­o­ries be­have badly when you don’t have com­put­ing power, that’s un­sur­pris­ing. Do you mean “the large com­put­ing power limit”.

Nope. The key point is that as com­put­ing power be­comes lower, Abram’s pro­cess al­lows more and more in­con­sis­tent mod­els.

the prob­a­bil­ity of them ap­pear­ing in the ran­dom pro­cess is sup­posed to be this ratio

The prob­a­bil­ity of a state­ment ap­pear­ing first in the model-gen­er­at­ing pro­cess is not equal to the prob­a­bil­ity that it’s mod­eled by the end.

• Nope. The key point is that as com­put­ing power be­comes lower, Abram’s pro­cess al­lows more and more in­con­sis­tent mod­els.

So does ev­ery pro­cess.

The prob­a­bil­ity of a state­ment ap­pear­ing first in the model-gen­er­at­ing pro­cess is not equal to the prob­a­bil­ity that it’s mod­eled by the end.

True. But for two very strong state­ments that con­tra­dict each other, there’s a close re­la­tion­ship.

• 29 Aug 2013 21:54 UTC
0 points

Highly re­lated: Bayesian Con­cept Learn­ing (pdf).

• This is just in­duc­tion for se­quence pre­dic­tion. We have plenty of tools for this.

• I’m kind of con­fused. Did we re­ally mean odds or primes? If we told the robot that this state­ment was true for the N in­te­gers, shouldn’t we have said it cor­rectly? If we did mean primes, then could at least have been hon­est, and said ‘2, 3, 5, 7’.

• Oh, whoops. I’ll fix that to make it more am­bigu­ous. :)

Any­how, it doesn’t mat­ter what we meant—what was com­mu­ni­cated to the robot was “a prop­erty shared by 3, 5, 7 that can be tested for in 14 char­ac­ters.” The robot doesn’t re­ally un­der­stand English la­bels.