# The Power of Noise

Re­cently Luke Muelhauser posed the ques­tion, “Can noise have power?”, which ba­si­cally asks whether ran­dom­iza­tion can ever be use­ful, or whether for ev­ery ran­dom­ized al­gorithm there is a bet­ter de­ter­minis­tic al­gorithm. This ques­tion was posed in re­sponse to a de­bate be­tween Eliezer Yud­kowsky and Scott Aaron­son, in which Eliezer con­tends that ran­dom­ness (or, as he calls it, noise) can’t ever be helpful, and Scott takes the op­pos­ing view. My goal in this es­say is to pre­sent my own views on this ques­tion, as I feel many of them have not yet been brought up in dis­cus­sion.

I’ll spare the reader some sus­pense and say that I ba­si­cally agree with Scott. I also don’t think – as some oth­ers have sug­gested – that this de­bate can be chalked up to a dis­pute about the mean­ing of words. I re­ally do think that Scott is get­ting at some­thing im­por­tant in the points he makes, which may be un­der­ap­pre­ci­ated by those with­out a back­ground in a field such as learn­ing the­ory or game the­ory.

Be­fore I start, I’d like to point out that this is re­ally a de­bate about Bayesi­anism in dis­guise. Sup­pose that you’re a Bayesian, and you have a pos­te­rior dis­tri­bu­tion over the world, and a util­ity func­tion, and you are con­tem­plat­ing two ac­tions A and B, with ex­pected util­ities U(A) and U(B). Then ran­domly pick­ing be­tween A and B will have ex­pected util­ity $\frac{U(A)+U(B)}{2}$, and so in par­tic­u­lar at least one of A and B must have higher ex­pected util­ity than ran­dom­iz­ing be­tween them. One can ex­tend this ar­gu­ment to show that, for a Bayesian, the best strat­egy is always de­ter­minis­tic. Scott in fact ac­knowl­edges this point, al­though in slightly differ­ent lan­guage:

“Ran­dom­ness prov­ably never helps in av­er­age-case com­plex­ity (i.e., where you fix the prob­a­bil­ity dis­tri­bu­tion over in­puts) -- since given any en­sem­ble of strate­gies, by con­vex­ity there must be at least one de­ter­minis­tic strat­egy in the en­sem­ble that does at least as well as the av­er­age.” -Scott Aaronson

I think this point is pretty iron-clad and I cer­tainly don’t wish to ar­gue against it. In­stead, I’d like to pre­sent sev­eral ex­am­ples of sce­nar­ios where I will ar­gue that ran­dom­ness clearly is use­ful and nec­es­sary, and use this to ar­gue that, at least in these sce­nar­ios, one should aban­don a fully Bayesian stance. At the meta level, this es­say is there­fore an ar­gu­ment in fa­vor of main­tain­ing mul­ti­ple paradigms (in the Kuh­nian sense) with which to solve prob­lems.

I will make four sep­a­rate ar­gu­ments, par­allel­ing the four main ways in which one might ar­gue for the adop­tion or dis­mis­sal of a paradigm:

1. Ran­dom­iza­tion is an ap­pro­pri­ate tool in many con­crete de­ci­sion-mak­ing prob­lems (game the­ory and Nash equil­ibria, in­di­visi­ble goods, ran­dom­ized con­trol­led tri­als).

2. Worst case analy­ses (which typ­i­cally lead to ran­dom­iza­tion) are of­ten im­por­tant from an en­g­ineer­ing de­sign per­spec­tive (mod­u­lar­ity of soft­ware).

3. Ran­dom al­gorithms have im­por­tant the­o­ret­i­cal prop­er­ties not shared by de­ter­minis­tic al­gorithms (P vs. BPP).

4. Think­ing in terms of ran­dom­ized con­struc­tions has solved many prob­lems that would have been difficult or im­pos­si­ble with­out this per­spec­tive (prob­a­bil­is­tic method, sam­pling al­gorithms).

1. Con­crete usefulness

Two peo­ple are play­ing rock-pa­per-scis­sors. After a while one of them starts to lose on av­er­age. What should they do? Clearly ran­dom­iz­ing will make them stop los­ing so they should prob­a­bly do that. I want to con­trast this with one of Eliezer’s points, which, roughly, is, “clearly you should ran­dom­ize if you’re up against some ad­ver­sar­ial su­per­in­tel­li­gence, but in this case we should la­bel it as ‘ad­ver­sar­ial su­per­in­tel­li­gence’ rather than worst case”. Many have taken this as ev­i­dence that Scott and Eliezer are re­ally just ar­gu­ing about the defi­ni­tion of words. I find it difficult to sup­port this con­clu­sion in light of the above rock-pa­per-scis­sors ex­am­ple: you don’t need to be play­ing an ad­ver­sar­ial su­per­in­tel­li­gence, you just need to be play­ing any­one who is bet­ter than you at rock-pa­per-scis­sors, which is a rea­son­ably com­mon oc­cur­rence.

A re­lated situ­a­tion shows up in many trade- and bar­gain­ing-re­lated situ­a­tions. For in­stances, sup­pose that I have a good that I value at $800, you value the same good at$1,000, but you only have $500 available to spend. We can still make a trade by agree­ing that I will give you the good with prob­a­bil­ity 50%, in ex­change for you pay­ing me$450, lead­ing us both to profit by $50 in ex­pec­ta­tion. This illus­trates how ran­dom­ness can be used to im­prove mar­ket liquidity. You might ar­gue that we could have made the same trade with­out ran­dom­iza­tion by find­ing an event that we both agree has 50% sub­jec­tive prob­a­bil­ity mass. But even if you are will­ing to do that, it is easy to see why an­other agent may not wish to—they may not wish to di­vulge in­for­ma­tion about their be­liefs, they may not trust you to ac­cu­rately di­vulge your own be­liefs, they may not even rep­re­sent their be­liefs as prob­a­bil­ities in the first place, etc. Since agents with these prop­er­ties are fairly nat­u­ral, it seems nec­es­sary to ran­dom­ize un­less one is will­ing to give up on prof­itable trad­ing op­por­tu­ni­ties. As yet an­other ex­am­ple of con­crete use­ful­ness, let us con­sider ran­dom­ized con­trol­led tri­als—in other words, us­ing ran­dom­ness in an ex­per­i­men­tal de­sign to en­sure that we have a rep­re­sen­ta­tive sub­set of a pop­u­la­tion in or­der to in­fer a par­tic­u­lar causal mechanism. When it is pos­si­ble to do so, ran­dom­iza­tion is an in­cred­ibly use­ful as­pect of ex­per­i­men­tal de­sign, free­ing us to make strong in­fer­ences about the data that would be much more difficult or im­pos­si­ble to make with the same level of con­fi­dence with­out us­ing ran­dom­iza­tion. For ex­am­ple, say that we want to know whether a job train­ing pro­gram im­proves par­ti­ci­pants’ earn­ings. If we sep­a­rate our sam­ple into a treat­ment and con­trol group us­ing ran­dom­iza­tion, and we find a real differ­ence in earn­ings be­tween the two, there is a strong case that the job train­ing pro­gram is the rea­son for the differ­ence in earn­ings. If we use any other method to sep­a­rate the two groups, we run the risk that our method for sep­a­rat­ing treat­ment from con­trol was cor­re­lated with some other fac­tor that can ex­plain a differ­ence in earn­ings. If we fol­low the “ran­dom­ness never helps” mantra here, then pre­sum­ably we should care­fully model all of the pos­si­ble con­found­ing effects in or­der to make in­fer­ences about the main effect—but of­ten the pur­pose of ex­per­i­men­ta­tion is to bet­ter un­der­stand phe­nom­ena about which we cur­rently know lit­tle, so that it is un­rea­son­able to think that we can ac­tu­ally ac­cu­rately model ev­ery pos­si­ble con­found in this way. In this situ­a­tion, “ran­dom­ness never helps” re­quires us to pre­tend to know the an­swer even in situ­a­tions where we clearly do not, which seems to me to be prob­le­matic. This is de­spite the fact that there doesn’t seem to be any “su­per­in­tel­li­gence”—or even in­tel­li­gent ad­ver­sary—in­volved in such situ­a­tions. 2. De­sign properties As noted in the in­tro­duc­tion, if we know the prob­a­bil­ity dis­tri­bu­tion of the en­vi­ron­ment, then the op­ti­mal strat­egy is always de­ter­minis­tic (for now, I ig­nore the is­sue of whether find­ing this op­ti­mal strat­egy is fea­si­ble, since we are talk­ing about whether ran­dom­ness helps “in prin­ci­ple”). On the other hand, if we do not know the dis­tri­bu­tion of the en­vi­ron­ment, there can be good rea­sons to ran­dom­ize, since this can help to “smooth out” po­ten­tial un­ex­pected struc­ture (ex­am­ples in­clude ran­dom­ized quick­sort as well as cer­tain load-bal­anc­ing al­gorithms). In other words, if we want to ar­gue that ran­dom­ness shouldn’t help, the most ob­vi­ous al­ter­na­tive to us­ing ran­dom­iza­tion would be to rea­son about the prob­a­bil­ity dis­tri­bu­tion of the en­vi­ron­ment, which seems to match what Eliezer has in mind, as far as I can tell. For in­stance, Eliezer says: I cer­tainly don’t say “it’s not hard work”, and the en­vi­ron­men­tal prob­a­bil­ity dis­tri­bu­tion should not look like the prob­a­bil­ity dis­tri­bu­tion you have over your ran­dom num­bers—it should con­tain cor­re­la­tions and struc­ture. But once you know what your prob­a­bil­ity dis­tri­bu­tion is, then you should do your work rel­a­tive to that, rather than as­sum­ing “worst case”. Op­ti­miz­ing for the worst case in en­vi­ron­ments that aren’t ac­tu­ally ad­ver­sar­ial, makes even less sense than as­sum­ing the en­vi­ron­ment is as ran­dom and un­struc­tured as ther­mal noise. Note that rea­son­ing about the prob­a­bil­ity dis­tri­bu­tion of the en­vi­ron­ment re­quires as­sum­ing a par­tic­u­lar dis­tri­bu­tion in the first place. I want to ar­gue that this is of­ten bad from a soft­ware de­sign per­spec­tive. First, soft­ware is sup­posed to be mod­u­lar. By this, I mean that soft­ware should be bro­ken into com­po­nents that: 1. Have a clearly speci­fied and ver­ifi­able in­put-out­put con­tract. 2. Are re-us­able in a wide va­ri­ety of situ­a­tions (for the most part, if the in­put-out­put con­tract can only be satis­fied in one in­stance ever then it isn’t a very use­ful con­tract). Pretty much ev­ery­one who works on soft­ware agrees with this. How­ever, re­quiring that the in­puts to a piece of soft­ware fol­low some prob­a­bil­ity dis­tri­bu­tion is the op­po­site of be­ing mod­u­lar. Why is this? First, it is im­pos­si­ble to ver­ify from a sin­gle in­put whether the in­put con­tract is satis­fied, since the con­tract is now over the dis­tri­bu­tion of in­puts, rather than a sin­gle in­put. Se­cond, re­quiring a cer­tain dis­tri­bu­tion over in­puts limits the soft­ware to the cases where the in­puts ac­tu­ally fol­low that dis­tri­bu­tion (or a suffi­ciently similar dis­tri­bu­tion), thus greatly limit­ing the scope of ap­pli­ca­bil­ity. Even worse, such a dis­tri­bu­tional re­quire­ment prop­a­gates back­ward through the en­tire sys­tem: if A calls B, which calls C, and C re­quires its in­puts to fol­low a cer­tain prob­a­bil­ity dis­tri­bu­tion, then B’s con­tract will prob­a­bly also have to call for a cer­tain prob­a­bil­ity dis­tri­bu­tion over in­puts (un­less all in­puts to C from B are gen­er­ated with­out refer­ence to B’s in­puts). On the other hand, soft­ware op­er­at­ing un­der worst-case analy­ses will (by defi­ni­tion) work re­gard­less of in­put, and will not gen­er­ate as­sump­tions that prop­a­gate back to the rest of the sys­tem. There­fore it is im­por­tant to be will­ing to adopt a worst-case per­spec­tive (or at the very least, some­thing weaker than as­sum­ing a full dis­tri­bu­tion over in­puts), at least in this in­stance. If one wants to ar­gue that the ran­dom­ness in this sce­nario is only use­ful prac­ti­cally but that some­thing de­ter­minis­tic would “in prin­ci­ple” be bet­ter, then they would have to also ar­gue that mod­u­lar­ity of soft­ware is use­ful prac­ti­cally but some­thing non-mod­u­lar would “in prin­ci­ple” be bet­ter. 3. The­o­ret­i­cal Properties The P vs. BPP ques­tion is the ques­tion of whether, when­ever there is a polyno­mial-time ran­dom­ized al­gorithm that be­haves cor­rectly on ev­ery in­di­vi­d­ual in­put with prob­a­bil­ity 99.9999999999%, there is also a polyno­mial-time de­ter­minis­tic al­gorithm that be­haves cor­rectly on ev­ery in­put. If ran­dom­ness was never use­ful, we would ex­pect the an­swer to this ques­tion to be straight­for­wardly yes—af­ter all, what­ever the ran­dom al­gorithm was do­ing, we should be able to im­prove upon it, and since a de­ter­minis­tic al­gorithm is ei­ther right or wrong, the only way to im­prove upon 99.9999999999% is to go all the way up to 100%. In re­al­ity, how­ever, this is a long-stand­ing open prob­lem in com­plex­ity the­ory that is gen­er­ally be­lieved to be true, but which is far from proven. This point was already raised by Scott in his origi­nal thread, but did not seem to gain much trac­tion. Sev­eral peo­ple were sus­pi­cious of the fact that the ran­dom­ized al­gorithm only had to work al­most all of the time on each in­put, while the de­ter­minis­tic al­gorithm had to work all of the time on each in­put. I have sev­eral re­sponses to this. The first, some­what flip­pant, re­sponse, is that we could of course only re­quire that the de­ter­minis­tic al­gorithm work al­most all of the time for each in­put, as well; but a de­ter­minis­tic al­gorithm can only get the an­swer right or wrong, so “al­most always” and “always” amount to the same thing for a de­ter­mistic al­gorithm. Se­condly, let us con­sider what it should mean for an al­gorithm to work “for all prac­ti­cal pur­poses”. If I have a ran­dom­ized al­gorithm, and its prob­a­bil­ity of failure is so small as to be neg­ligible com­pared to, say, cos­mic rays hit­ting the com­puter and cor­rupt­ing the out­put, then this seems to meet the bar of “for all prac­ti­cal pur­poses” (thus mo­ti­vat­ing the defi­ni­tion of BPP). On the other hand, what should “all prac­ti­cal pur­poses” mean for a de­ter­minis­tic al­gorithm? Some might ar­gue that if, on av­er­age across in­puts to the al­gorithm, the out­put is very likely to be cor­rect, then that should satisfy the re­quire­ment—but this raises the ques­tion, what av­er­age should we use? An ad­di­tion al­gorithm that thinks that 1+1=3 but is oth­er­wise cor­rect should surely not meet the “all prac­ti­cal pur­poses” bar, so the uniform dis­tri­bu­tion over in­puts cer­tainly shouldn’t be used. So, per­haps I should be figur­ing out what I ex­pect my in­put dis­tri­bu­tion to look like, and mea­sure ac­cord­ing to that? But now, you see, we have added bag­gage that wasn’t in the origi­nal prob­lem defi­ni­tion. All I said is that I had some ab­stract al­gorith­mic prob­lem that I wanted to solve for (some suit­able defi­ni­tion of) all prac­ti­cal pur­poses. You’ve now told me that in or­der to do this, I need to figure out what the in­put dis­tri­bu­tion to the al­gorithm is go­ing to be. But this is pretty un­rea­son­able! Can you imag­ine an un­der­grad­u­ate com­puter sci­ence text­book full of al­gorithms that would work, but only if its in­puts fol­lowed such-and-such a dis­tri­bu­tion? (I’m not claiming that such a guaran­tee is never use­ful, only dis­put­ing the no­tion that it should be raised up as the gold stan­dard of al­gorithms on the same level as P and BPP.) I don’t think any­one would learn much from such a text­book, and com­puter sci­ence would be in a pretty woe­ful state if this was all we could offer. I there­fore con­tend that the only re­ally rea­son­able no­tion of de­ter­minis­ti­cally solv­ing a prob­lem for all prac­ti­cal pur­poses is that it should always work cor­rectly on all pos­si­ble in­puts, thus lead­ing to the com­plex­ity class P. This is all a long-winded way of say­ing that the P vs. BPP ques­tion re­ally is the right way to for­mally ask whether ran­dom­ness can help from an al­gorith­mic stand­point. I’d also like to spe­cially ex­am­ine what hap­pens if we use a Solomonoff prior (as­sign­ing prob­a­bil­ity $2^{-k}$ to pro­grams of length k). If there is any pro­gram of length L that solves the prob­lem cor­rectly, then any de­ter­minis­tic al­gorithm of length M that is in­cor­rect with prob­a­bil­ity much less than $2^{-M-L}$ with re­spect to the Solomonoff prior must always work (since we can take the pro­gram that runs the two al­gorithms against each other un­til it finds a dis­crep­ancy, then uses that as the in­put). There­fore, “al­most always” with re­spect to Solomonoff is more or less the same as “always”, so we haven’t re­ally done any­thing differ­ent than re­define P in a very round­about way. Cru­cially, this means that even if we are fully Bayesian, un­der a Solomonoff prior the ques­tion of whether ev­ery ran­dom­ized polyno­mial-time al­gorithm has a de­ter­minis­tic coun­ter­part still boils down to the un­re­solved and deep P vs. BPP ques­tion. [Ad­den­dum: it looks like Peter de Blanc and Wei Dai have made this last point already, see here and here.) 4. Use­ful­ness in prob­lem-solving There are im­por­tant com­bi­na­to­rial prop­er­ties, such as ex­pan­sion prop­er­ties of graphs, which are satis­fied by ran­dom graphs but which we don’t have any de­ter­minis­tic con­struc­tions for at all (or in some cases, we do, but they came much later and were harder to come by). Many im­por­tant prob­lems in graph the­ory, Ram­sey the­ory, etc. were solved by con­sid­er­ing ran­dom com­bi­na­to­rial ob­jects (this was one of the great triumphs of Paul Er­dos) and think­ing in purely de­ter­minis­tic terms seems very un­likely to have solved these prob­lems. Tak­ing a ran­dom/​pseu­do­ran­dom per­spec­tive has even led to some of the mod­ern triumphs of math­e­mat­ics such as the Green-Tao the­o­rem (there are ar­bi­trar­ily long ar­ith­metic pro­gres­sions) and is fun­da­men­tal to the­o­ret­i­cal cryp­tog­ra­phy, game the­ory, con­vex anal­y­sis, and com­bi­na­to­rial ap­prox­i­ma­tion al­gorithms. If we aban­don the use of ran­dom con­struc­tions, then we also aban­don some of the most im­por­tant work in mod­ern com­puter sci­ence, eco­nomics, and ap­plied math­e­mat­ics, which seems un­ac­cept­able to me. On the al­gorith­mic side, even if you are Bayesian, you want to some­how rep­re­sent your pos­te­rior dis­tri­bu­tion, and one of the sim­plest ways to do so is by ran­dom sam­ples. An al­ter­na­tive way is by a prob­a­bil­ity den­sity func­tion, but un­der ex­tremely widely held com­plex­ity-the­o­retic as­sump­tions there are many prob­a­bil­ity dis­tri­bu­tions that can be effi­ciently rep­re­sented by sam­ples but whose prob­a­bil­ity den­sity func­tion can­not be effi­ciently writ­ten down. More prac­ti­cally, the view of rep­re­sent­ing dis­tri­bu­tions by ran­dom sam­ples has led to many im­por­tant al­gorithms such as se­quen­tial Monte Carlo (SMC) and Markov Chain Monte Carlo (MCMC), which form some of the cor­ner­stones of mod­ern statis­ti­cal in­fer­ence (MCMC in par­tic­u­lar has been called one of the 10 great­est al­gorithms of the 20th cen­tury). Again, aban­don­ing this work on philo­soph­i­cal grounds seems un­ac­cept­able. Of course, it is the­o­ret­i­cally pos­si­ble that all of the above gains will even­tu­ally be re­pro­duced with­out us­ing ran­dom­ness. How­ever, it seems im­por­tant to note that many of these are among the most beau­tiful and el­e­gant re­sults in their re­spec­tive fields. I would ob­ject to any char­ac­ter­i­za­tion of them as “hacks” that come down to a poorly-un­der­stood, lucky, but use­ful re­sult. • The ob­sta­cles to “de­ran­dom­iz­ing” such things ap­pear to be quite deep, as ex­tremely in­tel­li­gent peo­ple have been work­ing on them for decades with­out suc­cess. If one wants to say that a ran­dom­iza­tion-free, Bayesian ap­proach is “cor­rect even if not prac­ti­cal,” this be­comes true for a very ex­pan­sive defi­ni­tion of “prac­ti­cal,” and one must con­cede that be­ing a Bayesian is of­ten com­pletely im­prac­ti­cal even when a great deal of time and re­sources are available for rea­son­ing. • Think­ing in terms of ran­dom­ness is ex­tremely use­ful in ar­riv­ing at these deep and im­por­tant re­sults, and one of the main pur­poses of a paradigm is to provide modes of thought that lead to such re­sults. I would there­fore ar­gue that ran­dom­iza­tion should be a fun­da­men­tal lens (among many) with which to ap­proach prob­lems. Conclusion I have given sev­eral ex­am­ples of how ran­dom­ness is use­ful in both com­mon­place and fun­da­men­tal ways. A pos­si­ble ob­jec­tion is that yes, ran­dom­ness might be use­ful from a prag­matic per­spec­tive, but it is still the case that in the­ory we can always do bet­ter with­out ran­dom­ness. I don’t think that such an ob­jec­tion an­swers the above ar­gu­ments, ex­cept per­haps in some very weak sense. We saw how, in rock-pa­per-scis­sors, ran­dom­ness is always go­ing to be use­ful for one of the two play­ers (un­less both play­ers are equally strong); we saw how us­ing ran­dom­ness is nec­es­sary to make cer­tain prof­itable trades; and we saw how, even if we adopt a Solomonoff prior, the ques­tion of whether or not de­ter­minis­tic al­gorithms can com­pete with ran­dom­ized al­gorithms is still open (and the likely re­s­olu­tion will rely upon con­struct­ing suffi­ciently pseu­do­ran­dom num­bers, rather than rea­son­ing care­fully about the dis­tri­bu­tion of in­puts). Even if we dis­card all prag­matic con­cerns what­so­ever, the above ex­am­ples at the very least show that ran­dom­iza­tion is use­ful in a very deep and fun­da­men­tal way. And im­por­tantly, at least from my per­spec­tive, we have also seen how ran­dom­iza­tion has been es­sen­tial in re­solv­ing im­por­tant prob­lems across a va­ri­ety of fields. This, if noth­ing else, is a com­pel­ling case that it should be adopted as a fea­ture in one’s wor­ld­view, rather than dis­carded on philo­soph­i­cal grounds. Acknowledgements Thanks to Holden Karnofsky for read­ing an early ver­sion of this es­say and sug­gest­ing the ran­dom­ized con­trol­led tri­als ex­am­ple. Thanks also to Sindy Li for sug­gest­ing the idea of us­ing ran­dom­ness to im­prove mar­ket liquidity. • My eco­nomics PhD dis­ser­ta­tion re­lated to this. I pro­posed that if asym­met­ric in­for­ma­tion is a cause of par­ties failing to set­tle a law­suit they could both im­prove their po­si­tion us­ing lot­ter­ies. Con­sider a sim­ple ex­am­ple: I know that if we went to trial I would win at least$1 mil­lion, but trial would cost us both 200k. You don’t ac­cept my set­tle­ment offer of $1 mil­lion be­cause you think I’m ly­ing about the strength of my case. So I pro­pose the fol­low­ing, we flip a coin and if the coin comes up heads we set­tle for$1 mil­lion, whereas if it comes up tails we go to trial and if I fail to win at least \$1 mil­lion I pay you a big penalty.

But then my dis­ser­ta­tion con­cluded by say­ing that our failure to ob­serve liti­gants us­ing such lot­ter­ies is ev­i­dence against asym­met­ric in­for­ma­tion be­ing a cause of par­ties failure to set­tle law­suits.

• But then my dis­ser­ta­tion con­cluded by say­ing that our failure to ob­serve liti­gants us­ing such lot­ter­ies is ev­i­dence against asym­met­ric in­for­ma­tion be­ing a cause of par­ties failure to set­tle law­suits.

My in­ner Yud­kowsky says “or maybe it just hasn’t oc­curred to them”.

• But I pub­lished my re­sult in a pres­ti­gious jour­nal in 1997 and told lots of high sta­tus peo­ple about it, and still no lot­ter­ies.

• Did you ever ask any­one in a po­si­tion to use a lot­tery why they wouldn’t? “Peo­ple aren’t try­ing my idea” is ev­i­dence that it’s a bad idea, but weak ev­i­dence, prefer­ably re­placed by “Peo­ple say they aren’t try­ing my idea be­cause X” or “Peo­ple aren’t try­ing my idea but can’t ar­tic­u­late why not” when pos­si­ble.

• I asked judge Richard Pos­ner (one of my dis­ser­ta­tion ad­visers) if he would be will­ing to use lot­ter­ies as a judge and he said no, it would get him im­peached.

• The le­gal sys­tem is based on the le­gal fic­tion that the judge can in­fal­libly make a de­ci­sion. If the judge makes a de­ci­sion in a way which is guaran­teed to be fal­lible in a cer­tain per­centage of cases, he vi­o­lates this as­sump­tion, even if the guaran­teed fal­li­bil­ity from ran­dom­ness is less than his nor­mal fal­li­bil­ity when not us­ing ran­dom­ness.

• In­ter­est­ing idea. Brazilian law ex­plic­itly ad­mits lot­tery as a form of set­tling, but I’m not sure if that ex­am­ple with a penalty for not win­ning a law­suit would be ad­mis­si­ble.

• Re­place “ad­ver­sar­ial su­per­in­tel­li­gence” with “ad­ver­sar­ial game”, and I think you’ll get more agree­ment among the par­ti­ci­pants. There are plenty of cases where a “mixed strat­egy” is op­ti­mal. Note that this is not noise, and true ran­dom­ness isn’t nec­es­sary—it only needs to be un­pre­dictable by the op­po­nent, not nec­es­sar­ily ran­dom.

Where you don’t have an op­po­nent (or at least one that makes pre­dic­tions), I’m with Eliezer: noise never helps at a fun­da­men­tal level.

I do be­lieve that ran­dom­ness has a place in think­ing about prob­lems, and it’s eas­ier (for hu­mans) to rea­son about ran­dom­ness than in­sanely-com­plex de­ter­minis­tic calcu­la­tions. But that’s a prob­lem with the reader of the map, not with the map nor the ter­ri­tory.

• Where you don’t have an op­po­nent (or at least one that makes pre­dic­tions), I’m with Eliezer: noise never helps at a fun­da­men­tal level.

There is an­other case where noise helps- thresh­old effects. If you have as sig­nal be­low a thresh­old, a bit of noise can push the sig­nal up into the de­tectable re­gion.

• I’m with Eliezer: noise never helps at a fun­da­men­tal level.

To my think­ing, this is es­sen­tially equiv­a­lent to con­jec­tur­ing that P = BPP, which is plau­si­ble but still might be false.

ETA: Didn’t read the post be­fore re­ply­ing to the par­ent (saw it in the side­bar). Now I see that a good quar­ter of the post is about P = BPP. Egg on my face!

• Where you don’t have an op­po­nent (or at least one that makes pre­dic­tions), I’m with Eliezer: noise never helps at a fun­da­men­tal level.

Sev­eral of my ex­am­ples did not have op­po­nents. To list three: the bar­gain­ing ex­am­ple, the ran­dom­ized con­trol­led tri­als ex­am­ple, and P vs. BPP.

• Thanks for writ­ing this! This de­bate was bug­ging me too; I don’t have the back­ground to dive into it in de­tail as in this post but the fact that Eliezer was im­plic­itly tak­ing a pretty strong stance on P vs. BPP bugged me ever since I read the origi­nal LW post. This is also a great ex­am­ple of why I think LW needs more dis­cus­sion of top­ics like com­pu­ta­tional com­plex­ity.

• Many im­por­tant prob­lems in graph the­ory, Ram­sey the­ory, etc. were solved by con­sid­er­ing ran­dom com­bi­na­to­rial ob­jects (this was one of the great triumphs of Paul Er­dos) and think­ing in purely de­ter­minis­tic terms seems very un­likely to have solved these prob­lems.

From a Bayesian per­spec­tive, a prob­a­bil­ity is a cog­ni­tive ob­ject rep­re­sent­ing the known ev­i­dence about a propo­si­tion, flat­tened into a num­ber. It wouldn’t make sense to draw con­clu­sions about e.g. the ex­is­tence of cer­tain graphs, just be­cause we in par­tic­u­lar are un­cer­tain about the struc­ture of some graph.

The “prob­a­bil­is­tic method”, IMHO, is prop­erly viewed as the “mea­sure-the­o­retic method”, which is what math­e­mat­i­ci­ans usu­ally mean by “prob­a­bil­is­tic” any­way. That is, con­struc­tions in­volv­ing ran­dom ob­jects can usu­ally (always?) be thought of as putting a mea­sure on the space of all ob­jects, and then ar­gu­ing about sets of mea­sure 0 and 1 and etc. (I would be in­ter­ested in see­ing ex­am­ples where this trans­for­ma­tion is (a) not rel­a­tively straight­for­ward or (b) im­pos­si­ble.) Although the math is the same up to a point, these are differ­ent con­cep­tual tools. From Jaynes, Prob­a­bil­ity The­ory:

For ex­am­ple our sys­tem of prob­a­bil­ity could hardly, in style, philos­o­phy, and pur­pose, be more differ­ent from that of Kol­mogorov. What we con­sider to be fully half of prob­a­bil­ity the­ory as it is needed in cur­rent ap­pli­ca­tions (the prin­ci­ples for as­sign­ing prob­a­bil­ities by log­i­cal anal­y­sis of in­com­plete in­for­ma­tion) is not pre­sent at all in the Kol­mogorov sys­tem. Yet when all is said and done we find our­selves, to our own sur­prise, in agree­ment with Kol­mogorov and in dis­agree­ment with his crit­ics, on nearly all tech­ni­cal is­sues.

Whether think­ing in terms of ran­dom­ness is a use­ful con­cep­tual tool is a differ­ent ques­tion; per­son­ally, I try to sep­a­rate the in­tu­itions into Bayesian (for cog­ni­tion) and mea­sure the­ory (for ev­ery­thing else, e.g. ran­dom­ized al­gorithms, quan­tum me­chan­ics, etc.). It would be nice if these were one and the same, i.e. if Bayesian prob­a­bil­ity was just mea­sure the­o­retic prob­a­bil­ity over sets of hy­pothe­ses, but I don’t know of a good choice for the hy­poth­e­sis space. The Cox the­o­rems work from ba­sic desider­ata of a prob­a­bil­is­tic calcu­lus, in­de­pen­dent of any mea­sure the­ory; that is the ba­sis of Bayesian prob­a­bil­ity the­ory (see Jaynes, Chap­ter 2).

• It seems that in the rock-scis­sors-pa­per ex­am­ple the op­po­nent is quite liter­ally an ad­ver­sar­ial su­per­in­tel­li­gence. They are more in­tel­li­gent than you (at this game), and since they are play­ing against you, they are ad­ver­sar­ial. The RCT ex­am­ple also has a lot of ac­tors with differ­ent con­flicts of in­ter­ests, es­pe­cially money- and ca­reer-wise, and some can come pretty close to ad­ver­sar­ial.

• “ad­ver­sar­ial su­per­in­tel­li­gence” sounds like some­thing you don’t have to worry about fac­ing pre-sin­gu­lar­ity. “some­one who’s bet­ter than you at rock-pa­per-scis­sors” sounds rather more mun­dane. Us­ing the former term makes the situ­a­tion look ir­rele­vant by sneak­ing in con­no­ta­tions.

• Re­quiring that the in­puts to a piece of soft­ware fol­low some prob­a­bil­ity dis­tri­bu­tion is the op­po­site of be­ing mod­u­lar.

What? There is very lit­tle soft­ware that doesn’t re­quire in­puts to fol­low some prob­a­bil­ity dis­tri­bu­tion. When pro­vided with in­put that doesn’t match that (of­ten very nar­row) dis­tri­bu­tion pro­grams will throw it away, give up, or have prob­lems.

You seem to have put a lot more thought into your other points, could you ex­pand upon this a lit­tle more?

• Let me try to ex­press it more clearly here:

I agree that it is both rea­son­able and com­mon for pro­grams to re­quire that their in­puts satisfy cer­tain prop­er­ties (or in other words, for the in­puts to lie within a cer­tain set). But this is differ­ent than re­quiring that the in­puts be drawn from a cer­tain prob­a­bil­ity dis­tri­bu­tion (in other words, re­quiring 5% of the in­puts to be 0000, 7% to be 0001, 6% to be 0010, etc.). This lat­ter re­quire­ment makes the pro­gram very non-mod­u­lar be­cause in­vok­ing a method in one area of the pro­gram al­ters the ways in which I am al­lowed to in­voke the method in other ar­eas of the pro­gram (be­cause I have to make sure that the to­tal frac­tion of in­puts that are 0000 re­mains 5% across the pro­gram as a whole).

Does this make more sense or is it still un­clear? Thanks for the feed­back.

• But this is differ­ent than re­quiring that the in­puts be drawn from a cer­tain prob­a­bil­ity dis­tri­bu­tion (in other words, re­quiring 5% of the in­puts to be 0000, 7% to be 0001, 6% to be 0010, etc.)

Well, I don’t know of a sin­gle piece of soft­ware which re­quires that its in­puts come from spe­cific prob­a­bil­ity dis­tri­bu­tions in ad­di­tion to satis­fy­ing some set of prop­er­ties.

In fact, wouldn’t that soft­ware ob­ject have to main­tain some run­ning counts to even be able to es­ti­mate from which dis­tri­bu­tion do its in­puts come?

• Well, I don’t know of a sin­gle piece of soft­ware which re­quires that its in­puts come from spe­cific prob­a­bil­ity dis­tri­bu­tions in ad­di­tion to satis­fy­ing some set of prop­er­ties.

This is in some sense my point. If you want to be so hard­core Bayesian that you even look down on ran­dom­ized al­gorithms in soft­ware, then pre­sum­ably the al­ter­na­tive is to form a sub­jec­tive prob­a­bil­ity dis­tri­bu­tion over the in­puts to your al­gorithm (or per­haps there is some other ob­vi­ous al­ter­na­tive that I’m miss­ing). But I don’t know of many pieces of code that re­quire their in­puts to con­form to such a sub­jec­tive prob­a­bil­ity dis­tri­bu­tion; rather, the code should work for ALL in­puts (i.e. do a worst-case rather than av­er­age-case anal­y­sis, which will in some cases call for the use of ran­dom­ness). I take this to be ev­i­dence that the “never use ran­dom­ness” view would call for soft­ware that most en­g­ineers would con­sider poorly-de­signed, and as such is an un­pro­duc­tive view from the per­spec­tive of de­sign­ing good soft­ware.

• Any soft­ware that uses ran­dom­ness re­quires you to meet a prob­a­bil­ity dis­tri­bu­tion over its in­puts, namely that the ran­dom in­put needs to be ran­dom. I as­sume that you’re not claiming that this breaks mod­u­lar­ity, as you ad­vo­cate the use of ran­dom­ness in al­gorithms. Why?

• Based on the dis­cuss­sions with you and Lu­mifer, I up­dated the origi­nal text of that sec­tion sub­stan­tially. Let me know if it’s clearer now what I mean.

EDIT: Also, thanks for the feed­back so far!

• The prob­a­bil­ity dis­tri­bu­tion part is bet­ter, though I still don’t see how soft­ware that uses ran­dom­ness doesn’t fall un­der that (like­wise: com­pres­sion, image recog­ni­tion, sig­nal pro­cess­ing, and de­ci­sion mak­ing al­gorithms).

• If the soft­ware gen­er­ates its own ran­dom­ness (for in­stance, in JAVA you can do this by cre­at­ing a new in­stance of a Ran­dom ob­ject and call­ing nex­tInt(), etc.) then I don’t see how this breaks mod­u­lar­ity.

Com­pres­sion al­gorithms like bzip don’t promise to make things smaller as part of their con­tract, they only promise to be lossless. To the ex­tent that a user re­lies on them be­com­ing smaller con­sis­tently, this can lead to trou­ble, for in­stance if I ex­pect by in­puts of size 10 kilo­bytes to com­press down to 2 kilo­bytes, and then many of the in­puts in a row stay at 10 kilo­bytes, I can get un­ex­pected load that could cre­ate is­sues.

I don’t know of many image recog­ni­tion al­gorithms that are in­te­grated into a large sys­tem with­out a hu­man in the loop, ex­cept per­haps Google Image Search which ar­guably has a hu­man (the end-user) that filters the re­sults at the end. This is pre­cisely due to their non-mod­u­lar­ity: they fail in un­ex­pected and difficult-to-char­ac­ter­ize ways, such that any at­tempt to en­force a con­tract or ab­strac­tion would be prob­a­bly mis­guided at this point. The best they can cur­rently promise is “we cor­rectly im­ple­ment the fast Fourier trans­form” and leave it up to the pro­gram­mer to de­cide whether an FFT is what is mer­ited in a given case.

ETA: Another way to put the bzip ex­am­ple which might fit more with your lan­guage, is that yes the guaran­tee of bzip is that it is lossless and that it will make your data smaller as long as the data fit the prop­er­ties that bzip at­tempts to ex­ploit, and if that is what we want the con­tract to be then I would agree that bzip is non-mod­u­lar.

• for in­stance, in JAVA you can do this by cre­at­ing a new in­stance of a Ran­dom ob­ject and call­ing nex­tInt()

nit­pick: Java de­fault PRNG is a lin­ear con­gru­en­tial gen­er­a­tor. It’s not as bad as the in­fa­mous RANDU, but I won’t sug­gest to use it for any­thing that needs more than a hun­dred or so pseudo-ran­dom bits.

• bzip is a great source of ran­dom bits.

:)

• If you want to be so hard­core Bayesian that you even look down on ran­dom­ized al­gorithms in software

Do such peo­ple ex­ist? I don’t know any­one who fits such a de­scrip­tion.

I take this to be ev­i­dence that the “never use ran­dom­ness” view would call for soft­ware that most en­g­ineers would con­sider poorly-de­signed, and as such is an un­pro­duc­tive view from the per­spec­tive of de­sign­ing good soft­ware.

Well, of course. But again, has any­one been ar­gu­ing against that? The origi­nal dis­pute was talk­ing about ab­stract op­ti­mal­ity and whether par­tic­u­lar solu­tions ex­ist. No one sug­gested that in real-world prac­tice at the cur­rent level of tech­nol­ogy rand() Call Con­sid­ered Harm­ful.

• Do such peo­ple ex­ist? I don’t know any­one who fits such a de­scrip­tion.

What about Eliezer?

• Well, let’s put it this way. If you can im­ple­ment a non-ran­dom­ized solu­tion that works within the speci­fied con­straints (time, etc.), it is prefer­able to a prob­a­bil­is­tic one. The real ques­tion is what hap­pens when you can’t. I doubt that Eliezer would re­fuse to im­ple­ment a prob­a­bil­is­tic solu­tion on the grounds that it’s not pure enough and so no solu­tion at all is bet­ter than a ver­sion tainted by its con­tact with the RNG.

An ex­cep­tion might be when you need to be ab­solutely sure what the soft­ware does or does not do and any ran­dom­ness in­tro­duces an un­ac­cept­able risk. But such a case is very rare.

• I doubt that Eliezer would re­fuse to im­ple­ment a prob­a­bil­is­tic solu­tion on the grounds that it’s not pure enough and so no solu­tion at all is bet­ter than a ver­sion tainted by its con­tact with the RNG.

Sure, I agree with this. But he still seems to think that in prin­ci­ple it is always pos­si­ble to im­prove over a ran­dom­ized al­gorithm. But do­ing so re­quires hav­ing some knowl­edge of the dis­tri­bu­tion over the en­vi­ron­ment, and that would break mod­u­lar­ity.

Whether or not Eliezer him­self is bas­ing this ar­gu­ment on Bayesian grounds, it cer­tainly seems to be the case that many com­menters are, e.g.:

How­ever if it is an im­por­tant prob­lem and you think you might be able to find some reg­u­lar­i­ties, the best bet would to be to do bayesian up­dates on the most likely po­si­tions to be ones and prefer­en­tially choose those

And yet this is no differ­ent from a de­ter­minis­tic al­gorithm. It can also query O(1) bits, and “with high prob­a­bil­ity” have a cer­tain an­swer.

And some com­ments by Eliezer:

Quan­tum branch­ing is “truly ran­dom” in the sense that branch­ing the uni­verse both ways is an ir­re­ducible source of new in­dex­i­cal ig­no­rance. But the more im­por­tant point is that un­less the en­vi­ron­ment re­ally is out to get you, you might be wiser to ex­ploit the ways that you think the en­vi­ron­ment might de­part from true ran­dom­ness, rather than us­ing a noise source to throw away this knowl­edge.

I cer­tainly don’t say “it’s not hard work”, and the en­vi­ron­men­tal prob­a­bil­ity dis­tri­bu­tion should not look like the prob­a­bil­ity dis­tri­bu­tion you have over your ran­dom num­bers—it should con­tain cor­re­la­tions and struc­ture. But once you know what your prob­a­bil­ity dis­tri­bu­tion is, then you should do your work rel­a­tive to that, rather than as­sum­ing “worst case”. Op­ti­miz­ing for the worst case in en­vi­ron­ments that aren’t ac­tu­ally ad­ver­sar­ial, makes even less sense than as­sum­ing the en­vi­ron­ment is as ran­dom and un­struc­tured as ther­mal noise.

• Another way of swap­ping around the ques­tion is to ask un­der what cir­cum­stances Ja­cob Stein­hardt would re­fuse to use a PRNG rather than an RNG be­cause the PRNG wasn’t ran­dom enough, and whether there’s any in­stance of such that doesn’t in­volve an in­tel­li­gent ad­ver­sary (or that an­cient crude PRNG with bad dis­tri­bu­tional prop­er­ties that ev­ery­one always cites when this topic comes up, i.e., has that hap­pened more re­cently with an OK-ap­pear­ing PRNG).

Ob­vi­ously I don’t in­tend to take a stance on the math-qua-math ques­tion of P vs. BPP. But to the ex­tent that some­one has to as­sert that an al­gorithm’s good BPP-re­lated prop­er­ties only work for an RNG rather than a PRNG, and there’s no in­tel­li­gent ad­ver­sary of any kind in­volved in the sys­tem, I have to ques­tion whether this could rea­son­ably hap­pen in real life. Hav­ing writ­ten that sen­tence it doesn’t feel very clear to me. What I’m try­ing to point at gen­er­ally is that un­less I have an in­tel­li­gent ad­ver­sary I don’t want my un­der­stand­ing of a piece of code to de­pend on whether a par­tic­u­lar zero bit is “de­ter­minis­tic” or “ran­dom”. I want my un­der­stand­ing to say that the code has just the same effect once the zero is gen­er­ated, re­gard­less of what fac­tors gen­er­ated the zero; I want to be able to screen off the “ran­dom­ness” once I’ve looked at the out­put of that ran­dom­ness, and just ask about the effec­tive­ness of us­ing a zero here or a one there. Fur­ther­more I dis­trust any paradigm which doesn’t look like that, and re­ject it as some­thing I could re­ally-truly be­lieve, un­til the busi­ness about “ran­dom­ness” has been screened off and elimi­nated from the anal­y­sis. Un­less I’m try­ing to evade a cryp­to­graphic ad­ver­sary who re­ally can pre­dict me if I choose the wrong PRNG or write down my ran­dom bits some­place that some­one else can see them, so that writ­ing down the out­put of an RNG and then feed­ing it into the com­pu­ta­tion as a de­ter­minis­tic con­stant is gen­uinely worse be­cause my ad­ver­sary might sneak a look at the RNG’s out­put if I left it writ­ten down any­where. Or I’m try­ing to ran­dom­ize a study and pre­vent ac­ci­den­tal cor­re­la­tions with other peo­ple’s study, so I use an RNG just in case some­body else used a similar PRNG.

But oth­er­wise I don’t like my math treat­ing the same bit differ­ently de­pend­ing on whether it’s “ran­dom” or “de­ter­minis­tic” be­cause its ac­tual effect on the al­gorithm is the same and ought to be screened off from its ori­gins once it be­comes a 1 or 0.

(And there’s also a deep Bayesian is­sue here re­gard­ing, e.g., our abil­ity to ac­tu­ally look at the con­tents of an en­velope in the two-en­velope prob­lem and up­date our prior about amounts of money in en­velopes to ar­rive at the pos­te­rior, rather than find­ing it in­tu­itive to think that we picked an en­velope ran­domly and that the ran­dom­ized ver­sion of this al­gorithm will ini­tially pick the en­velope con­tain­ing the larger amount of money half the time, which I think is a very clear illus­tra­tion of the Bad Con­fused Thoughts into which you’re li­able to be led down a gar­den-path, if you op­er­ate in a paradigm that doesn’t find it in­tu­itive to look at the ac­tual value of the ran­dom bit and ask about what we think about that ac­tual value apart from the “ran­dom” pro­cess that sup­pos­edly gen­er­ated it. But this is­sue the mar­gins are too small to con­tain.)

Is that helpful?

• Ob­vi­ously I don’t in­tend to take a stance on the math-qua-math ques­tion of P vs. BPP. But to the ex­tent that some­one has to as­sert that an al­gorithm’s good BPP-re­lated prop­er­ties only work for an RNG rather than a PRNG, and there’s no in­tel­li­gent ad­ver­sary of any kind in­volved in the sys­tem, I have to ques­tion whether this could rea­son­ably hap­pen in real life.

If your ques­tion is “Is there a known prac­ti­cal case not in­volv­ing an in­tel­li­gent ad­ver­sar­ial en­vi­ron­ment where the use of a cryp­to­graphic PRNG or even a true RNG rather than a non-cryp­to­graphic PRNG is war­ranted?” Then the an­swer is no.
In fact, this is the rea­son why it is con­jec­tured that P = BPP.

How­ever, given the rest of your com­ment, it seems that you are refer­ring to how we un­der­stand the the­o­ret­i­cal prop­er­ties of al­gorithms:

What I’m try­ing to point at gen­er­ally is that un­less I have an in­tel­li­gent ad­ver­sary I don’t want my un­der­stand­ing of a piece of code to de­pend on whether a par­tic­u­lar zero bit is “de­ter­minis­tic” or “ran­dom”. I want my un­der­stand­ing to say that the code has just the same effect once the zero is gen­er­ated, re­gard­less of what fac­tors gen­er­ated the zero; I want to be able to screen off the “ran­dom­ness” once I’ve looked at the out­put of that ran­dom­ness, and just ask about the effec­tive­ness of us­ing a zero here or a one there. Fur­ther­more I dis­trust any paradigm which doesn’t look like that, and re­ject it as some­thing I could re­ally-truly be­lieve, un­til the busi­ness about “ran­dom­ness” has been screened off and elimi­nated from the anal­y­sis.

If we are talk­ing about un­der­stand­ing the the­o­ret­i­cal prop­er­ties of many use­ful ran­dom­ized al­gorithms, I’d say that we can’t “screen off” ran­dom­ness.
Even if the al­gorithm is im­ple­mented us­ing a PRNG with a con­stant seed, and is thus fully de­ter­minis­tic at the level of what ac­tu­ally runs on the ma­chine, when we rea­son about its the­o­ret­i­cal prop­er­ties, whether it is a for­mal anal­y­sis or a pre-for­mal in­tu­itive anal­y­sis, we need to ab­stract away the PRNG and as­sume that the al­gorithm has ac­cess to true ran­dom­ness.

As it was already pointed out, if we were perfectly ra­tio­nal Bayesian agents, then, we would always be able to in­clude the PRNG into our anal­y­sis:
For in­stance, given a ma­chine learn­ing prob­lem, a Bayesian agent would model it as a sub­jec­tive prob­a­bil­ity dis­tri­bu­tion, and it may con­clude that for that par­tic­u­lar dis­tri­bu­tion the op­ti­mal al­gorithm is an im­ple­men­ta­tion of Ran­dom For­est al­gorithm with the Mersenne Twister PRNG ini­tial­ized with seed 42.
For a slightly differ­ent sub­jec­tive dis­tri­bu­tion, the op­ti­mal al­gorithm may be an im­ple­men­ta­tion of a Neu­ral Net­work trained with Er­ror Back­prop­a­ga­tion with weights ini­tial­ized by a Lin­ear Con­gru­entlial PRNG with seed 12345.
For an­other slightly differ­ent sub­jec­tive dis­tri­bu­tion, the op­ti­mal al­gorithm may be some en­tirely differ­ent de­ter­minis­tic al­gorithm.

In prac­tice, we can’t rea­son this way. There­fore we as­sume true ran­dom­ness in or­der to say mean­ingful things about many prac­ti­cally use­ful al­gorithms.

• A more in­volved post about those Bad Con­fused Thoughts and the deep Bayesian is­sue un­der­ly­ing it would be re­ally in­ter­est­ing, when and if you ever have time for it.

• in prin­ci­ple it is always pos­si­ble to im­prove over a ran­dom­ized algorithm

in prin­ci­ple are the key words.

As the old say­ing goes, in the­ory there is no differ­ence be­tween the­ory and prac­tice but in prac­tice there is.

re­quires hav­ing some knowl­edge of the dis­tri­bu­tion over the en­vi­ron­ment, and that would break mod­u­lar­ity.

You are us­ing the word “mod­u­lar­ity” in a sense weird to me. From my per­spec­tive, in the soft­ware con­text “mod­u­lar­ity” refers to in­ter­ac­tions of pieces of soft­ware, not in­puts or dis­tri­bu­tions over the en­vi­ron­ment.

• Based on the dis­cuss­sions with you and trist, I up­dated the origi­nal text of that sec­tion sub­stan­tially. Let me know if it’s clearer now what I mean.

Also, thanks for the feed­back so far!

• I think an ex­am­ple of what jstein­hardt is refer­ring to would be quick­sort. It can take an ar­bi­trary list as an ar­gu­ment, but for many per­versely or­dered in­puts it takes Omega(n^2). How­ever, it does have an effi­cient av­er­age-case com­plex­ity of O(n log n). In other words, if the in­put is sam­pled from the uniform dis­tri­bu­tion over per­mu­ta­tions the al­gorithm is guaran­teed to finish in O(n log n) time.

Many of the other ex­am­ples that were con­sid­ered are similar, in that the al­gorithm doesn’t give an er­ror when given an in­put out­side of the ex­pected dis­tri­bu­tion, but rather silently works less effectively

• I think an ex­am­ple of what jstein­hardt is refer­ring to would be quick­sort.

Quick­sort as a piece of soft­ware does not re­quire any par­tic­u­lar dis­tri­bu­tion. It is perfectly happy to work with “per­versely or­dered in­puts”.

A hu­man run­ning quick­sort with cer­tain ex­pec­ta­tions about its perfor­mance might re­quire a par­tic­u­lar dis­tri­bu­tion, but that’s not a char­ac­ter­is­tic of soft­ware.

Re­call that the origi­nal claim was that ex­pect­ing a par­tic­u­lar dis­tri­bu­tion breaks mod­u­lar­ity.

• If a par­tic­u­lar func­tion was ad­ver­tised as hav­ing O(n log(n)) run­ning time, and it turns out that it ac­tu­ally runs in O(n^2) on a cer­tain class of in­puts, I would not feel it to be be­hav­ing in a very mod­u­lar way. From an ab­strac­tion per­spec­tive, break­ing time or re­source as­sump­tions on an in­put can be as bad or worse than re­ject­ing it.

Granted, it’s pretty easy to get quick­sort to be­have on all but the most patholog­i­cal data sets.

• If a par­tic­u­lar func­tion was ad­ver­tised as hav­ing O(n log(n)) run­ning time, and it turns out that it ac­tu­ally runs in O(n^2) on a cer­tain class of in­puts, I would not feel it to be be­hav­ing in a very mod­u­lar way.

If you have been mis­led with re­spect to the be­hav­ior of an al­gorithm on a spe­cific class of in­puts, what does it have to do with mod­u­lar­irty??

Mo­du­lar­ity refers to the in­de­pen­dence of a piece of soft­ware from the par­tic­u­lars of other pieces of soft­ware. It has noth­ing to do with perfor­mance met­rics.

• When pick­ing which al­gorithms to use for par­tic­u­lar pieces of your task, perfor­mance met­rics are a sig­nifi­cant fac­tor. If you are told that your al­gorithm for Sec­tion A of the task runs in O(n log n) time and you know that the Sec­tion B por­tion of the pro­gram, which re­quires a stream on in­puts from Sec­tion A, will run in O(n (log n)^2), you can as­sume that Sec­tion B will be con­stantly sup­plied with in­puts and the pro­gram can busy-wait for the brief por­tions where it does not have in­put. If it turns out that for your spe­cific in­put dis­tri­bu­tion Sec­tion A is tak­ing O(n^2), that’s a crit­i­cal dis­tinc­tion with far-reach­ing effects on the de­sign of the rest of your al­gorithm.

It’s some­what harder to provide an ex­am­ple for non-net­worked, non-par­allelized code, but I can do so on re­quest.

• I un­der­stand all that, but I’m read­ing it as “you should not make mis­takes about the ex­pected perfor­mance of your code and your ap­pli­ca­tion ar­chi­tec­ture should re­flect your use cases.”

There are many ways to screw up a soft­ware sys­tem.

• A hu­man run­ning quick­sort with cer­tain ex­pec­ta­tions about its perfor­mance might re­quire a par­tic­u­lar dis­tri­bu­tion, but that’s not a char­ac­ter­is­tic of soft­ware.

I think this may be a dis­tinc­tion with­out a differ­ence; mod­u­lar­ity can also be defined as hu­man ex­pec­ta­tions about soft­ware, namely that the soft­ware will be rel­a­tively easy to hook into a larger sys­tem.

• mod­u­lar­ity can also be defined as hu­man ex­pec­ta­tions about software

I don’t find this a use­ful defi­ni­tion, but YMMV, de gustibus, and all that...

• So you’re differ­en­ti­at­ing be­tween prop­er­ties where the prob­a­bil­ity of [0 1 2 3] is 1-ɛ while >3 is ɛ and prob­a­bil­ity dis­tri­bu­tions where the prob­a­bil­ity of 0 is 0.01, 1 is 0.003, etc? Got it. The only al­gorithms that I can think of that re­quire the lat­ter are those that re­quire uniformly ran­dom in­put. I don’t think those vi­o­late mod­u­lar­ity though, as any are of the pro­gram that in­ter­faces with that mod­ule must provide in­de­pen­dently ran­dom in­put (which would be the straight­for­ward way to meet that re­quire­ment with an ar­bi­trary dis­tri­bu­tion).

There’s a differ­ence be­tween re­quiring and be­ing op­ti­mized for though, and there are lots of al­gorithms that are op­ti­mized for par­tic­u­lar in­puts. Sort al­gorithms are a ex­cel­lent ex­am­ple, if most of your lists are al­most already sorted, there al­gorithms that are cheaper on av­er­age, but might take a long time with a num­ber of rare or­der­ings.

• I’d also like to spe­cially ex­am­ine what hap­pens if we use a Solomonoff prior (as­sign­ing prob­a­bil­ity 2^-k to pro­grams of length k). If there is any pro­gram of length L that solves the prob­lem cor­rectly, then any de­ter­minis­tic al­gorithm of length M that is in­cor­rect with prob­a­bil­ity much less than 2^-(M+L) with re­spect to the Solomonoff prior must always work (since we can take the pro­gram that runs the two al­gorithms against each other un­til it finds a dis­crep­ancy, then uses that as the in­put). There­fore, “al­most always” with re­spect to Solomonoff is more or less the same as “always”, so we haven’t re­ally done any­thing differ­ent than re­define P in a very round­about way. Cru­cially, this means that even if we are fully Bayesian, un­der a Solomonoff prior the ques­tion of whether ev­ery ran­dom­ized polyno­mial-time al­gorithm has a de­ter­minis­tic coun­ter­part still boils down to the un­re­solved and deep P vs. BPP ques­tion.

I’ve read this 3 times, and still can’t tell what it’s sup­posed to mean. For starters, what is the “that” in “uses that as the in­put”? Gram­mat­i­cally, it should be “a dis­crep­ancy”, but a dis­crep­ancy is a pair of out­puts that are differ­ent, and I don’t see how to use that as an in­put. I’m also dis­mayed by the phrase “al­most always … is more or less the same as always”, which is a tau­tol­ogy, and so should not be a con­clu­sion.

• The ran­dom­ized con­trol trial is a great ex­am­ple where a su­per­in­tel­li­gence ac­tu­ally could do bet­ter by us­ing a non-ran­dom strat­egy. Ideally, an AI could take its whole prior into ac­count and do a value of in­for­ma­tion calcu­la­tion. Even if it had no use­ful prior, that would just mean that any method of choos­ing is equally “ran­dom” un­der the the AI’s knowl­edge.

• AI != perfectly ra­tio­nal agent

• Bayesian adap­tive clini­cal trial de­signs place sub­jects in treat­ment groups based on a pos­te­rior dis­tri­bu­tion. (Clini­cal tri­als ac­crue pa­tients grad­u­ally, so you don’t have to as­sign the pa­tients us­ing the prior: you as­sign new pa­tients us­ing the pos­te­rior con­di­tioned on ob­ser­va­tions of the cur­rent pa­tients.)

Th­ese adap­tive tri­als are, as you con­jec­ture, much more effi­cient than tra­di­tional ran­dom­ized tri­als.

Ex­am­ple: I-SPY 2. As­signs pa­tients to treat­ments based on their “bio­mark­ers” (biolog­i­cal mea­sure­ments made on the pa­tients) and the pos­te­rior de­rived from pre­vi­ous pa­tients.

When I heard one of the au­thors ex­plain adap­tive tri­als in a talk, he said they were based on multi-armed ban­dit the­ory, with a util­ity func­tion that com­bines ac­cu­racy of re­sults with welfare of the pa­tients in the trial.

How­ever, un­like in con­ven­tional multi-armed ban­dit the­ory, the trial de­sign still makes ran­dom de­ci­sions! The tri­als are still sort of ran­dom­ized: “adap­tively ran­dom­ized,” with pa­tients hav­ing a higher chance of be­ing as­signed to cer­tain groups than oth­ers, based on the cur­rent pos­te­rior dis­tri­bu­tion.

• I don’t un­der­stand this com­ment at all.

• JWW sug­gests that an AI could par­ti­tion trial sub­jects into con­trol and ex­per­i­men­tal groups such that ex­pected num­ber of events in both was equal, and pre­sum­ably also such that cases in­volv­ing as­sump­tions were dis­tributed equally, to min­i­mize the im­pact of as­sump­tions. For in­stance, an AI do­ing a study of re­sponses to an ar­tifi­cial sweet­ener could do some calcu­la­tions to es­ti­mate the im­pact of each gene on sugar metabolism, then par­ti­tion sub­jects so as to bal­ance their allele fre­quen­cies for those genes.

(A more ex­treme in­ter­pre­ta­tion would be that the AI is par­ti­tion­ing sub­jects and perform­ing the ex­per­i­ment not in a way de­signed to test a sin­gle hy­poth­e­sis, but to max­i­mize to­tal in­for­ma­tion ex­tracted from the ex­per­i­ment. This would be op­ti­mal, but a rad­i­cal de­par­ture from how we do sci­ence. Ac­tu­ally, now that I think of it, I wrote a grant pro­posal sug­gest­ing this 7 years ago. My idea was that molec­u­lar biol­ogy must now be done by in­ter­pos­ing a layer of ab­strac­tion via com­pu­ta­tional in­tel­li­gence in be­tween the sci­en­tist and the data, so that the sci­en­tist is fram­ing hy­pothe­ses not about in­di­vi­d­ual genes or pro­teins, but about causes, effects, or sys­tems. It was not well-re­ceived.)

There’s an­other com­ment some­where coun­ter­ing this idea by not­ing that this al­most re­quires om­ni­science; the method one uses to bal­ance out one bias may in­tro­duce an­other.

• This doesn’t re­quire om­ni­science, or AI: peo­ple do this now (based on info they have). If you have more info, we know how to use it (there is the­ory). Why are we talk­ing about AI, this is a math prob­lem.

Slightly harshly worded sug­ges­tion (not to you speci­fi­cally): maybe more read­ing, less in­vo­ca­tion of robo-Je­sus in vain.

• Eliezer claims that ran­dom­ness is always bad; many other peo­ple claim that one way ran­dom­ness is good is that it is un­bi­ased. Par­ti­tion­ing sub­jects into ex­per­i­men­tal con­di­tions must be un­bi­ased. Us­ing an al­gorithm and know­ing that its bi­ases are or­thog­o­nal to the phe­nomenon be­ing in­ves­ti­gated re­quires om­ni­science. Be­sides, if you knew in ad­vance what was rele­vant, you wouldn’t need to do the ex­per­i­ment.

That is what the com­ment means. The use of the term “AI” is just to show that the claim is that no real-world agent can be smart enough to do un­bi­ased par­ti­tion­ing in all cases, not just that we’re not smart enough to do it.

In prac­tice, a pos­si­bly bi­ased but in­tel­li­gent par­ti­tion­ing is bet­ter when the sam­ple size is small.

• We know what prop­erty we want (that ran­dom­iza­tion will give you), good bal­ance in rele­vant co­vari­ates be­tween two groups. I can use a de­ter­minis­tic al­gorithm for this, and in fact peo­ple do, e.g. match­ing al­gorithms. Another thing peo­ple do is try all pos­si­ble as­sign­ments (see: per­mu­ta­tion tests for the null).

Dis­cus­sion of AI and om­ni­science is a com­plete red her­ring, you don’t need that to show that you don’t need ran­dom­ness for this. We aren’t ran­dom­iz­ing for the sake of ran­dom­iz­ing, we are do­ing it be­cause we want some prop­erty that we can di­rectly tar­get de­ter­minis­ti­cally.

I don’t think EY can pos­si­bly know enough math to make his claim go through, I think this is an “in­tel­lec­tual mar­ket­ing” claim. Peo­ple do this a lot, if we are talk­ing about your claim, you won the game.

• If you sort all the sub­jects on one crite­ria, it may be cor­re­lated in an un­ex­pected way with an­other crite­ria you’re un­aware of. Sup­pose you want to study whether li­corice causes left-hand­ed­ness in a pop­u­la­tion from Ton­awanda, NY. So you get a list of ad­dresses from Ton­awanda New York, sort them by ad­dress, and go down the list throw­ing them al­ter­nately into con­trol and ex­per­i­men­tal group. Then you mail the ex­per­i­men­tal group free li­corice for a ten years. Voila, af­ter 10 years there are more left-han­ders in the ex­per­i­men­tal group.

But even and odd ad­dresses are on op­po­site sides of the street. And it so hap­pens that in Ton­awanda, NY, the screen doors on the front of ev­ery house are hinged on the west side, re­gard­less of which way the house faces, be­cause the west wind is so strong it would rip the door off its hinges oth­er­wise. So peo­ple on the north side of the street, who are mostly in your ex­per­i­men­tal group, open the door with their left hand, get­ting a lot of ex­er­cise from this (the wind is very strong), while peo­ple on the south side open the screen door with their right hand.

It seems un­likely to me that many hid­den cor­re­la­tions would sur­vive al­ter­nat­ing picks from a sorted list like this rigged ex­am­ple, but if the sam­ple size is large enough, you’d still be bet­ter off ran­dom­iz­ing than fol­low­ing any de­ter­minis­tic al­gorithm, be­cause “ev­ery other item from a list sorted on X” has low Kol­mogorov com­plex­ity and can be repli­cated by an un­known cor­re­late of your ob­serv­able vari­able by chance.

• Eliezer claims that ran­dom­ness is always bad; many other peo­ple claim that one way ran­dom­ness is good is that it is un­bi­ased. Par­ti­tion­ing sub­jects into ex­per­i­men­tal con­di­tions must be un­bi­ased.

This is per­haps a use­ful place to illus­trate the “ran­dom­ness hath no power” ar­gu­ment: ran­dom­ness is un­bi­ased in ex­pec­ta­tion but we ac­tu­ally ex­pect the ab­solute amount of bi­ased­ness for a ran­domly se­lected as­sign­ment to be nonzero. When bi­as­ing fac­tors are known ahead of time, we do bet­ter by con­trol­ling for it di­rectly (with, say, a paired as­sign­ment).

• Ex­actly! This is a math prob­lem! And it be­comes a very com­pli­cated math prob­lem very quickly as the prior in­for­ma­tion gets in­ter­est­ing.

There’s noth­ing mag­i­cal about an AI; it can’t figure out any­thing a hu­man couldn’t figure out in prin­ci­ple. The differ­ence is the “su­per­in­tel­li­gence” bit: a su­per­in­tel­li­gent AI could effi­ciently use much more com­pli­cated prior in­for­ma­tion for ex­per­i­ment de­sign.

• I don’t un­der­stand the im­prove­ment you think is pos­si­ble here. In a lot of cases, the math isn’t the prob­lem, the the­ory is known. The difficulty is usu­ally find­ing a large enough sam­ple size,etc.

• There is a lot of statis­ti­cal liter­a­ture on op­ti­mal ex­per­i­men­tal de­sign, and it’s used all the time. Years ago at In­tel, we spent a lot of time on op­ti­mal de­sign of qual­ity con­trol mea­sure­ments, and I have no doubt a lot of in­dus­trial sci­en­tists in other com­pa­nies spend their time think­ing about such things.

The prob­lem is, in­for­ma­tion is a model de­pen­dent con­cept (deriva­tives of log-like­li­hood de­pend on the like­li­hood), so if your prior isn’t fairly strong, there isn’t a lot of im­prove­ment to be had. A lot of sci­ence is ex­plo­ra­tory, try­ing to op­ti­mize the ex­per­i­men­tal de­sign is pre­ma­ture.

Either way, this isn’t stuff you need an AI for at all, it’s stuff peo­ple talk about and think about now, to­day, us­ing com­puter as­sisted hu­man in­tel­lect.

• perform­ing the ex­per­i­ment not in a way de­signed to test a sin­gle hy­poth­e­sis, but to max­i­mize to­tal in­for­ma­tion ex­tracted from the experiment

De­sign­ing ex­per­i­ments to get more in­for­ma­tion than just ev­i­dence for a sin­gle hy­poth­e­sis is old hat.

• I’m go­ing to com­mit a so­cial faux pas and re­spond to my own com­ment, be­cause mul­ti­ple sub­threads are all say­ing es­sen­tially the same thing: this is just math, the the­ory is known, hu­mans can already do it (of­ten with some help from com­put­ers to get through the math).

As I’ve read it, one of the ma­jor take­aways of less­wrong is that AI is not mag­i­cal. If hu­mans can­not pos­si­bly figure out the the­ory, nei­ther can an AI. If hu­mans can­not pos­si­bly do the math (pos­si­bly with some help from a com­puter), nei­ther can an AI. Any­thing an AI can do, a hu­man can also do in prin­ci­ple. They differ only in the de­gree: AIs will even­tu­ally be able to do much more com­pli­cated math, solve much more com­pli­cated prob­lems, and self-im­prove much faster and more re­li­ably.

So if you look at my origi­nal sug­ges­tion and think “that’s noth­ing spe­cial, a hu­man can do that in the­ory” then you’re com­pletely cor­rect. Things hu­mans can do IN THEORY are EXACTLY the things with which an AI can help.

• Some­thing funny has hap­pened to the font of this post, mak­ing it difficult for me to read.

• Should be fixed now, I think.

• It seems to have to do with the fact that the font is some­times Times, and some­times what­ever the de­fault font is. I don’t see an easy way of fix­ing this that doesn’t in­volve me chang­ing lots of HTML tags by hand. If any­one knows how to fix it eas­ily then please let me know.

• Um, load your post into a text ed­i­tor and do a global search-and-re­place..?

• I’ve had some suc­cess in the past cut-n-past­ing text into Notepad (which au­to­mat­i­cally strips all for­mat­ting) and then back into the edit win­dow.

• Won’t that kill things like bul­lets, links, images, etc.?

• Yeah, I guess so. Still might be worth it rel­a­tive to chang­ing lots of HTML tags by hand. By my (in all like­li­hood, none too ac­cu­rate) count, you have four links, three la­tex ex­pres­sions (only one of which needs to be la­tex), two num­bered lists, one bul­let list, and a bunch of sec­tion ti­tles. Not too bad to do by hand...

• In­stead, I’d like to pre­sent sev­eral ex­am­ples of sce­nar­ios where I will ar­gue that ran­dom­ness clearly is use­ful and nec­es­sary, and use this to ar­gue that, at least in these sce­nar­ios, one should aban­don a fully Bayesian stance.

Sure. But, in or­der to check the com­mu­ni­ca­tion chan­nel, which of those ex­am­ples do you think I (as one of the peo­ple in the pre­vi­ous post ar­gu­ing for see­ing the de­bate as one over word-in­ter­pre­ta­tion) would dis­agree with ei­ther your con­tent or your pre­sen­ta­tion? How about Eliezer?

I think I should once again push the point that the lan­guage and com­mu­ni­ca­tion difficul­ties are the pri­mary drivers of this dis­cus­sion, not the math­e­mat­i­cal difficul­ties: Eliezer’s origi­nal post, to me*, is that there is a limited but large case of prob­lems where ran­dom­ness shouldn’t help, and it’s dis­ap­point­ing that aca­demics in that field don’t know this, and that the ter­minol­ogy they use lends it­self to this con­fu­sion.

*I should point out I mis­in­ter­preted him in one sum­ma­riza­tion, at least, by short­en­ing a ’kill phrase X and re­place it with phrase Y” as “use phrase X as short­hand for phrase Y”

• is that there is a limited but large case of prob­lems where ran­dom­ness shouldn’t help, and it’s dis­ap­point­ing that aca­demics in that field don’t know this, and that the ter­minol­ogy they use lends it­self to this con­fu­sion.

Which field are we talk­ing about? What peo­ple? The weighted ma­jor­ity al­gorithm (the topic of the post that started this all) is one of the cor­ner­stones of statis­ti­cal learn­ing the­ory. I would guess that pretty much ev­ery­one who knows statis­ti­cal learn­ing the­ory well already knows that pure strate­gies are op­ti­mal given com­plete (prob­a­bil­is­tic) knowl­edge of the en­vi­ron­ment.

• pure strate­gies are op­ti­mal given com­plete (prob­a­bil­is­tic) knowl­edge of the en­vi­ron­ment.

As­sum­ing the ex­is­tence of closed-form solu­tions which is not a given.

If your en­vi­ron­ment is suffi­ciently com­plex, you may not be able dis­cover the op­ti­mal pure strat­egy in rea­son­able time.

• I mean yes, I did just write a quar­ter of my post on this topic :).

• Ap­par­ently the eas­iest way to con­struct an ex­pan­der graph is via ran­dom­ness. Deter­minis­tic con­struc­tions are very difficult.

• This illus­trates how ran­dom­ness can be used to im­prove mar­ket liquidity.

While an in­ter­est­ing idea, I be­lieve most peo­ple just call this “gam­bling”.

Re­quiring that the in­puts to a piece of soft­ware fol­low some prob­a­bil­ity dis­tri­bu­tion is the op­po­site of be­ing mod­u­lar.

I don’t un­der­stand this. It’s perfectly fine for a mod­ule (an ob­ject) to de­clare what kind of in­puts it will ac­cept. Mo­du­lar­ity in soft­ware ba­si­cally means “write to the de­clared in­ter­face and treat in­ter­nals as a black box” and I don’t see why re­quiring a par­tic­u­lar set of in­puts is a prob­lem.

Over­all, though, I agree—ran­dom­ness is highly use­ful in the real world where even if an op­ti­mal solu­tion is known to ex­ist it is of­ten the case that you can’t spend the re­sources (no­tably, time) to figure it out.

• While an in­ter­est­ing idea, I be­lieve most peo­ple just call this “gam­bling”.

I’m not sure what you’re driv­ing at here. A gam­bling sys­tem where ev­ery­body has a net ex­pected gain is still a good use of ran­dom­ness.

• A gam­bling sys­tem where ev­ery­body has a net ex­pected gain

So what’s the differ­ence be­tween a mar­ket where you can buy a chance to get some as­set for cash, and a cas­ino where you can buy a chance to get some as­set for cash? You have non-co­erced hu­mans do­ing vol­un­tary trans­ac­tions in both of them.

Don’t for­get that you still need a coun­ter­party in the mar­kets.

• I don’t un­der­stand this.

At­tempted a clearer ex­pla­na­tion in this com­ment. Any feed­back you have would be great!