# Using the universal prior for logical uncertainty (retracted)

(I’m not sure about any of this)

(Yup! Vanessa Kowalski pointed out a cru­cial er­ror, see her com­ment be­low. The idea about the uni­ver­sal prior pick­ing up all com­putable reg­u­lar­i­ties is un­proved. I’m leav­ing the post as it was, but please read it as a his­tor­i­cal doc­u­ment only.)

Abram Dem­ski had an old pa­per de­scribing a naive way to use the uni­ver­sal prior for log­i­cal un­cer­tainty. He pointed out some prob­lems with it, but I’m not sure the prob­lems are real. I think the ap­proach will work sur­pris­ingly well.

The idea is that, since the uni­ver­sal prior is just a prob­a­bil­ity dis­tri­bu­tion over bit se­quences, we don’t have to use it to pre­dict bits one by one. We can also up­date it on facts like “the sev­enth bit is 1” with­out men­tion­ing the pre­vi­ous six bits. To choose which bits to up­date on, we can me­chan­i­cally search through all pos­si­ble proofs in Peano ar­ith­metic (PA) by in­creas­ing length. When­ever we prove a sen­tence with Gödel num­ber N, we tell the prior that the Nth bit is 1; when­ever we dis­prove one, we tell the prior that the bit is 0. After each up­date, the prior’s opinions about all other bits (and thus all other sen­tences in PA) will shift, hope­fully in a mean­ingful way.

It’s easy to see that the prob­a­bil­ity of any prov­able sen­tence will con­verge to 1, be­cause at some point it be­comes 1 by fiat, and dis­prov­able sen­tences will like­wise go to 0. How fast is the con­ver­gence? I think it will be ex­tremely fast, far out­pac­ing our slow proof search and sub­sum­ing any pos­si­ble hu­man or com­puter rea­son­ing about PA—even though the prior knows noth­ing about PA and only re­ceives opaque bits.

To see why, re­call that the uni­ver­sal prior can pick up any com­putable reg­u­lar­i­ties. For ex­am­ple, if you feed it a se­quence where all even-num­bered bits are 1 and odd-num­bered bits are hard to pre­dict, it will quickly be­come very cer­tain that all fu­ture even-num­bered bits will be 1. Similarly, if there’s some de­cid­able set of Gödel num­bers that for some rea­son cor­re­spond to prov­able sen­tences (no mat­ter how long their proofs are), the uni­ver­sal prior will be­come very cer­tain of that af­ter re­ceiv­ing enough data. The same goes for reg­u­lar­i­ties in re­la­tion­ships be­tween bits, like “for all A and B, if A is prov­able and A→B is prov­able, then B is prov­able as well”. All such short­cuts will be learned in full gen­er­al­ity as quickly as pos­si­ble.

This ap­proach to log­i­cal un­cer­tainty can work well for ques­tions that can be set­tled by PA, like digits of π. What about sen­tences that aren’t prov­able or dis­prov­able in PA, like Con(PA)? It’s tempt­ing to try and make the pro­cess con­verge for all of those as well, but I don’t want to do that, be­cause many such sen­tences are ba­si­cally hal­lu­ci­na­tions. Who cares what their prob­a­bil­ities are.

Why does this ap­proach mat­ter? Isn’t the uni­ver­sal prior un­com­putable? But it’s ap­prox­imable, so any ap­prox­i­ma­tion scheme could be used in the same way as log­i­cal in­duc­tion (an­other com­putable pro­cess with un­com­putably slow con­ver­gence). Also the uni­ver­sal prior is well-stud­ied and has many nice prop­er­ties.

I don’t want to diminish the amaz­ing work that the log­i­cal in­duc­tion folks have done. In some ways (e.g. re­flec­tion) it works bet­ter than the uni­ver­sal prior. But it’s in­ter­est­ing to me that the idea of prob­a­bil­is­tic rea­son­ing about fu­ture the­o­rems based on a small amount of known the­o­rems can be cap­tured so well by a stan­dard ap­proach, similar to how UDT can be al­most com­pletely cov­ered by a one-liner com­bi­na­tion of stan­dard game the­ory tools.

• To see why, re­call that the uni­ver­sal prior can pick up any com­putable reg­u­lar­i­ties. For ex­am­ple, if you feed it a se­quence where all even-num­bered bits are 1 and odd-num­bered bits are hard to pre­dict, it will quickly be­come very cer­tain that all fu­ture even-num­bered bits will be 1. Similarly, if there’s some de­cid­able set of Gödel num­bers that for some rea­son cor­re­spond to prov­able sen­tences (no mat­ter how long their proofs are), the uni­ver­sal prior will be­come very cer­tain of that af­ter re­ceiv­ing enough data. The same goes for reg­u­lar­i­ties in re­la­tion­ships be­tween bits, like “for all A and B, if A is prov­able and A→B is prov­able, then B is prov­able as well”. All such short­cuts will be learned in full gen­er­al­ity as quickly as pos­si­ble.

Can you cite the re­sults you are rely­ing on? There is this pa­per from by 2011 by Lat­ti­more, Hut­ter and Ga­vane which shows that (a par­tic­u­lar ver­sion of) Solomonoff in­duc­tion can pre­dict de­ter­minis­tic reg­u­lar sub­se­quences. How­ever, it doesn’t cover fully gen­eral re­la­tion­ships be­tween bits. For ex­am­ple, if fu­ture bit i is pre­dictably (ac­cord­ing to some com­putable reg­u­lar­ity) equal to fu­ture bit j, afaik it is not proved that Solomonoff in­duc­tion will as­sign both bits the same prob­a­bil­ity (it is only proved that it will cor­rectly pre­dict the later of the two bits af­ter see­ing all bits that pre­cede it). See also the “Open Ques­tions” sec­tion in the pa­per (there is no known stochas­tic gen­er­al­iza­tion or known con­ver­gence rate). In fact, the abil­ity to deal with fully gen­eral pat­terns is, I think, the main ad­van­tage of log­i­cal in­duc­tion (that, and the fact it com­bines this abil­ity with ap­ply­ing com­pu­ta­tional re­source bounds).

• Well, you’re right. I’ll put a re­trac­tion in the post. Thank you!

By ask­ing for cita­tions you’re giv­ing me too much credit—I’m play­ing the game a few lev­els be­low you. I was go­ing off the “ob­vi­ous” idea that if our prior gives nonzero prob­a­bil­ity to a hy­poth­e­sis, and we feed the prior a se­quence of bits that matches the hy­poth­e­sis, then the pos­te­rior prob­a­bil­ity of the hy­poth­e­sis should rise to 1. Took me some time to re­al­ize that’s not true at all!

Here’s a sim­ple coun­terex­am­ple. Say our hy­poth­e­sis is “all even bits are 0”, and our prior is an equal mix­ture of “all even bits are 0 and all odd bits are ran­dom” and “all odd bits are 0 and all even bits are ran­dom”. Note that our hy­poth­e­sis starts out with prob­a­bil­ity 12 ac­cord­ing to the prior. But if we feed the prior a se­quence of all 0′s, which matches the hy­poth­e­sis, the pos­te­rior prob­a­bil­ity of the hy­poth­e­sis won’t go to 1. It will keep os­cillat­ing for­ever. The prior just hap­pens to be too good at guess­ing this par­tic­u­lar se­quence with­out help from our hy­poth­e­sis.

I’d be sur­prised if some­thing like that hap­pened with com­putable hy­pothe­ses and the uni­ver­sal prior, but I don’t have a proof and couldn’t find one in a few hours. So thanks again.

• Fur­ther, If Solomonoff In­duc­tion does get these prob­lems right, it does so be­cause clo­sure prop­er­ties on the class of hy­pothe­ses, not be­cause of prop­er­ties of the way in which the hy­pothe­ses are com­bined.

In the Log­i­cal In­duc­tion frame­work, if you add a bunch of other un­com­putable hy­pothe­ses, you will still get the good prop­er­ties on the pre­dictable sub-pat­terns of the en­vi­ron­ment.

If you start with the Solomonoff In­duc­tion frame­work, this is demon­stra­bly not true: If I have an en­vi­ron­ment which is 1 on even bits and un­com­putable on odd bits, I can add an un­com­putable hy­poth­e­sis that knows all the odd bits. It can gain trust over time, then spend that trust to say that the next even bit has a 90% chance to be a 0. It will take a hit from this pre­dic­tion, but can earn back the trust from the odd bits and re­peat.

• Isn’t that a com­plaint against Bayes, not just Solomonoff? Take any prior P. Take a se­quence S whose even bits are all 1 and whose odd bits max­i­mally dis­agree with P. Take an­other prior H that ex­actly knows the odd bits of S but thinks the even bits are ran­dom. Now the prior (P+H)/​2 will never learn that all even bits of S are 1.

So I’m not quite ready to buy that com­plaint, be­cause it seems to me that any good method should be at least ap­prox­i­mately Bayesian. But maybe I don’t un­der­stand enough about what you’re do­ing...

• It is a com­plaint against Bayes, but it is only a com­plaint against us­ing Bayes in cases where the real world is prob­a­bil­ity 0 in your prior.

Part of the point of log­i­cal in­duc­tion is that logic is com­pli­cated and no hy­poth­e­sis in the log­i­cal in­duc­tion al­gorithm can ac­tu­ally pre­dict it cor­rectly in full, but the al­gorithm al­lows for the hy­pothe­ses to prove them­selves on a sub-pat­tern, and have the en­sem­ble con­verge to the cor­rect be­hav­ior on that sub-pat­tern.

• Is there a sim­ple “con­tin­u­ous” de­scrip­tion of the class of ob­jects that LI be­longs to, which shows the point of de­par­ture from Bayes with­out rely­ing on all de­tails of LI? (For ex­am­ple, “it’s like a prior but the re­sult also de­pends on or­der­ing of in­put facts”.)

• Not re­ally.

You can gen­er­al­ize LI to ar­bi­trary col­lec­tions of hy­pothe­ses, and in­ter­pret it as be­ing about bit se­quences rather logic, but not much more than that.

The rea­son the LI pa­per talks about the LI crite­rion rather than a spe­cific al­gorithm is to push in that di­rec­tion, but it is not as clean as your ex­am­ple.

• I’m not sure I un­der­stand the ques­tion cor­rectly, but what “LI” ac­tu­ally de­pends on is, more or less, a col­lec­tion of traders plus a “prior” over them (al­though you can’t in­ter­pret it as an ac­tual prior since more than one trader can be im­por­tant in un­der­stand­ing a given en­vi­ron­ment). Plus there is some am­bi­guity in the pro­cess of choos­ing fixed points (be­cause there might be mul­ti­ple fixed points).

• To choose which bits to up­date on, we can me­chan­i­cally search through all pos­si­ble proofs in Peano ar­ith­metic (PA) by in­creas­ing length. When­ever we prove a sen­tence with Gödel num­ber N, we tell the prior that the Nth bit is 1; when­ever we dis­prove one, we tell the prior that the bit is 0. After each up­date, the prior’s opinions about all other bits (and thus all other sen­tences in PA) will shift, hope­fully in a mean­ingful way.

I don’t think this will give us what we want. There’s a short pro­gram which searches through all pos­si­ble proofs in PA up to length say 10^100, then starts print­ing out bits, 1 for the Nth bit if it found a proof for sen­tence with Gödel num­ber N, 0 if it found a dis­proof for that sen­tence, and a coin toss if it didn’t find ei­ther a proof or dis­proof. This pro­gram would do perfectly on the bits that you pro­pose, and it’s hard to think of com­pa­rably short pro­grams that would do just as well on those bits (and aren’t just vari­a­tions of this with differ­ent proof length limits), so if you up­date the uni­ver­sal prior on this set of bits you’ll end up with a pos­te­rior that con­sists al­most en­tirely of this type of pro­gram/​hy­poth­e­sis, which seems use­less both con­cep­tu­ally and for prac­ti­cal pur­poses.

You could in­stead try up­dat­ing the uni­ver­sal prior with the prob­a­bil­ity es­ti­mates of a hu­man math­e­mat­i­cian on var­i­ous math­e­mat­i­cal state­ments, but then af­ter a while you might just be pick­ing out that par­tic­u­lar hu­man from the space of all pro­grams. I don’t know what else that can be done with­out run into prob­lems like these.

• Sure, we prob­a­bly end up with mostly slow pro­grams. The in­ter­est­ing ques­tion to me is whether the ap­proach works at all. If it does—if us­ing se­quence pre­dic­tion for log­i­cal un­cer­tainty is okay—then the next step is to try some other method of se­quence pre­dic­tion that re­wards fast pro­grams, like the speed prior. (Or it might hap­pen any­way as we ap­prox­i­mate the uni­ver­sal prior, though I’m not sure of that.) I won­der how well that would work, com­pared to the log­i­cal in­duc­tion pa­per.

• Us­ing only speed to eval­u­ate mod­els lands you with a lookup table that stores the re­sults you want. So you have to trade off speed and length: The speed­iest table of math re­sults has length O(N) and speed O(N) (maybe? Not at all sure). The short­est gen­eral pro­gram has length O(log(N)) and un­com­putably fast-grow­ing time. So if we think of a sep­a­rable cost func­tion F(length)+G(time), as long as F doesn’t grow su­per-fast nor G su­per-slow, even­tu­ally the lookup table will have bet­ter score than brute-force search.

Ideally you want to find some happy medium—this is re­mind­ing my of my old post on ap­prox­i­mate in­duc­tion.

• You’re Man­fred! Huh. I no­ticed you mak­ing an un­usual amount of sense in some com­ment threads, but didn’t make the con­nec­tion. Great to meet you again!

Us­ing a com­bi­na­tion of length and time sounds right. The speed prior uses length+log(time), I don’t know if it’s the best choice but it seems okay?

That said, I don’t quite un­der­stand your post. The uni­ver­sal prior is just a prior, not any par­tic­u­lar ap­prox­i­ma­tion scheme. Sure, it’s equiv­a­lent to a mix­ture of all pro­grams that pre­dict bits ex­actly, but it’s also equiv­a­lent to a mix­ture of all pro­grams that com­pute prob­a­bil­ity dis­tri­bu­tions over bits. (That fact is sur­pris­ing and non­triv­ial, but true. It’s in the Li and Vi­tanyi book.) So you can ap­prox­i­mate it on noisy data just fine.

• Nice to meet you again too :)

Yeah, I know that I’m some­what slop­pily iden­ti­fy­ing the pri­ors with the pro­grams that make up most of their weight. It makes it more con­ve­nient for me to think about what agents us­ing those pri­ors would do—though I am prob­a­bly miss­ing some de­tails that would stem from us­ing a com­putable ap­prox­i­ma­tion rather than the pos­si­bly un­com­putable one of look­ing at the most suc­cess­ful few pro­grams.

And it serves me right for not googling “speed prior.” I for­got it was length+log(time). I’m sure com­pu­ta­tional com­plex­ity peo­ple would have much more in­ter­est­ing things to say than me about why that might get you plenty of math­e­mat­ics pro­grams that are nei­ther brute force nor lookup ta­bles. Or maybe it’s just that tak­ing the log of time to turn it into “bits” is the num­ber one ob­vi­ous thing to do.

If we think of the Solomonoff prior as the “real an­swer” and a speed-like prior as a con­ve­nience used for mak­ing com­putable pre­dic­tions when com­pu­ta­tion has a cost, we could get some kind of prin­ci­pled an­swer from es­ti­mates of the benefit of pre­dic­tive ac­cu­racy and the cost of time. I spent a few min­utes and couldn’t figure it out—worst case rea­son­ing about pre­dic­tive ac­cu­racy gets stuck on some very rare worst cases, and I’m not sure how to do the av­er­age case rea­son­ing right.