The Power of Noise

Re­cently Luke Muelhauser posed the ques­tion, “Can noise have power?”, which ba­si­cally asks whether ran­dom­iza­tion can ever be use­ful, or whether for ev­ery ran­dom­ized al­gorithm there is a bet­ter de­ter­minis­tic al­gorithm. This ques­tion was posed in re­sponse to a de­bate be­tween Eliezer Yud­kowsky and Scott Aaron­son, in which Eliezer con­tends that ran­dom­ness (or, as he calls it, noise) can’t ever be helpful, and Scott takes the op­pos­ing view. My goal in this es­say is to pre­sent my own views on this ques­tion, as I feel many of them have not yet been brought up in dis­cus­sion.

I’ll spare the reader some sus­pense and say that I ba­si­cally agree with Scott. I also don’t think – as some oth­ers have sug­gested – that this de­bate can be chalked up to a dis­pute about the mean­ing of words. I re­ally do think that Scott is get­ting at some­thing im­por­tant in the points he makes, which may be un­der­ap­pre­ci­ated by those with­out a back­ground in a field such as learn­ing the­ory or game the­ory.

Be­fore I start, I’d like to point out that this is re­ally a de­bate about Bayesi­anism in dis­guise. Sup­pose that you’re a Bayesian, and you have a pos­te­rior dis­tri­bu­tion over the world, and a util­ity func­tion, and you are con­tem­plat­ing two ac­tions A and B, with ex­pected util­ities U(A) and U(B). Then ran­domly pick­ing be­tween A and B will have ex­pected util­ity

, and so in par­tic­u­lar at least one of A and B must have higher ex­pected util­ity than ran­dom­iz­ing be­tween them. One can ex­tend this ar­gu­ment to show that, for a Bayesian, the best strat­egy is always de­ter­minis­tic. Scott in fact ac­knowl­edges this point, al­though in slightly differ­ent lan­guage:

“Ran­dom­ness prov­ably never helps in av­er­age-case com­plex­ity (i.e., where you fix the prob­a­bil­ity dis­tri­bu­tion over in­puts) -- since given any en­sem­ble of strate­gies, by con­vex­ity there must be at least one de­ter­minis­tic strat­egy in the en­sem­ble that does at least as well as the av­er­age.” -Scott Aaronson

I think this point is pretty iron-clad and I cer­tainly don’t wish to ar­gue against it. In­stead, I’d like to pre­sent sev­eral ex­am­ples of sce­nar­ios where I will ar­gue that ran­dom­ness clearly is use­ful and nec­es­sary, and use this to ar­gue that, at least in these sce­nar­ios, one should aban­don a fully Bayesian stance. At the meta level, this es­say is there­fore an ar­gu­ment in fa­vor of main­tain­ing mul­ti­ple paradigms (in the Kuh­nian sense) with which to solve prob­lems.

I will make four sep­a­rate ar­gu­ments, par­allel­ing the four main ways in which one might ar­gue for the adop­tion or dis­mis­sal of a paradigm:

  1. Ran­dom­iza­tion is an ap­pro­pri­ate tool in many con­crete de­ci­sion-mak­ing prob­lems (game the­ory and Nash equil­ibria, in­di­visi­ble goods, ran­dom­ized con­trol­led tri­als).

  2. Worst case analy­ses (which typ­i­cally lead to ran­dom­iza­tion) are of­ten im­por­tant from an en­g­ineer­ing de­sign per­spec­tive (mod­u­lar­ity of soft­ware).

  3. Ran­dom al­gorithms have im­por­tant the­o­ret­i­cal prop­er­ties not shared by de­ter­minis­tic al­gorithms (P vs. BPP).

  4. Think­ing in terms of ran­dom­ized con­struc­tions has solved many prob­lems that would have been difficult or im­pos­si­ble with­out this per­spec­tive (prob­a­bil­is­tic method, sam­pling al­gorithms).

1. Con­crete usefulness

Two peo­ple are play­ing rock-pa­per-scis­sors. After a while one of them starts to lose on av­er­age. What should they do? Clearly ran­dom­iz­ing will make them stop los­ing so they should prob­a­bly do that. I want to con­trast this with one of Eliezer’s points, which, roughly, is, “clearly you should ran­dom­ize if you’re up against some ad­ver­sar­ial su­per­in­tel­li­gence, but in this case we should la­bel it as ‘ad­ver­sar­ial su­per­in­tel­li­gence’ rather than worst case”. Many have taken this as ev­i­dence that Scott and Eliezer are re­ally just ar­gu­ing about the defi­ni­tion of words. I find it difficult to sup­port this con­clu­sion in light of the above rock-pa­per-scis­sors ex­am­ple: you don’t need to be play­ing an ad­ver­sar­ial su­per­in­tel­li­gence, you just need to be play­ing any­one who is bet­ter than you at rock-pa­per-scis­sors, which is a rea­son­ably com­mon oc­cur­rence.

A re­lated situ­a­tion shows up in many trade- and bar­gain­ing-re­lated situ­a­tions. For in­stances, sup­pose that I have a good that I value at $800, you value the same good at $1,000, but you only have $500 available to spend. We can still make a trade by agree­ing that I will give you the good with prob­a­bil­ity 50%, in ex­change for you pay­ing me $450, lead­ing us both to profit by $50 in ex­pec­ta­tion. This illus­trates how ran­dom­ness can be used to im­prove mar­ket liquidity. You might ar­gue that we could have made the same trade with­out ran­dom­iza­tion by find­ing an event that we both agree has 50% sub­jec­tive prob­a­bil­ity mass. But even if you are will­ing to do that, it is easy to see why an­other agent may not wish to—they may not wish to di­vulge in­for­ma­tion about their be­liefs, they may not trust you to ac­cu­rately di­vulge your own be­liefs, they may not even rep­re­sent their be­liefs as prob­a­bil­ities in the first place, etc. Since agents with these prop­er­ties are fairly nat­u­ral, it seems nec­es­sary to ran­dom­ize un­less one is will­ing to give up on prof­itable trad­ing op­por­tu­ni­ties.

As yet an­other ex­am­ple of con­crete use­ful­ness, let us con­sider ran­dom­ized con­trol­led tri­als—in other words, us­ing ran­dom­ness in an ex­per­i­men­tal de­sign to en­sure that we have a rep­re­sen­ta­tive sub­set of a pop­u­la­tion in or­der to in­fer a par­tic­u­lar causal mechanism. When it is pos­si­ble to do so, ran­dom­iza­tion is an in­cred­ibly use­ful as­pect of ex­per­i­men­tal de­sign, free­ing us to make strong in­fer­ences about the data that would be much more difficult or im­pos­si­ble to make with the same level of con­fi­dence with­out us­ing ran­dom­iza­tion. For ex­am­ple, say that we want to know whether a job train­ing pro­gram im­proves par­ti­ci­pants’ earn­ings. If we sep­a­rate our sam­ple into a treat­ment and con­trol group us­ing ran­dom­iza­tion, and we find a real differ­ence in earn­ings be­tween the two, there is a strong case that the job train­ing pro­gram is the rea­son for the differ­ence in earn­ings. If we use any other method to sep­a­rate the two groups, we run the risk that our method for sep­a­rat­ing treat­ment from con­trol was cor­re­lated with some other fac­tor that can ex­plain a differ­ence in earn­ings.

If we fol­low the “ran­dom­ness never helps” mantra here, then pre­sum­ably we should care­fully model all of the pos­si­ble con­found­ing effects in or­der to make in­fer­ences about the main effect—but of­ten the pur­pose of ex­per­i­men­ta­tion is to bet­ter un­der­stand phe­nom­ena about which we cur­rently know lit­tle, so that it is un­rea­son­able to think that we can ac­tu­ally ac­cu­rately model ev­ery pos­si­ble con­found in this way. In this situ­a­tion, “ran­dom­ness never helps” re­quires us to pre­tend to know the an­swer even in situ­a­tions where we clearly do not, which seems to me to be prob­le­matic. This is de­spite the fact that there doesn’t seem to be any “su­per­in­tel­li­gence”—or even in­tel­li­gent ad­ver­sary—in­volved in such situ­a­tions.

2. De­sign properties

As noted in the in­tro­duc­tion, if we know the prob­a­bil­ity dis­tri­bu­tion of the en­vi­ron­ment, then the op­ti­mal strat­egy is always de­ter­minis­tic (for now, I ig­nore the is­sue of whether find­ing this op­ti­mal strat­egy is fea­si­ble, since we are talk­ing about whether ran­dom­ness helps “in prin­ci­ple”). On the other hand, if we do not know the dis­tri­bu­tion of the en­vi­ron­ment, there can be good rea­sons to ran­dom­ize, since this can help to “smooth out” po­ten­tial un­ex­pected struc­ture (ex­am­ples in­clude ran­dom­ized quick­sort as well as cer­tain load-bal­anc­ing al­gorithms). In other words, if we want to ar­gue that ran­dom­ness shouldn’t help, the most ob­vi­ous al­ter­na­tive to us­ing ran­dom­iza­tion would be to rea­son about the prob­a­bil­ity dis­tri­bu­tion of the en­vi­ron­ment, which seems to match what Eliezer has in mind, as far as I can tell. For in­stance, Eliezer says:

I cer­tainly don’t say “it’s not hard work”, and the en­vi­ron­men­tal prob­a­bil­ity dis­tri­bu­tion should not look like the prob­a­bil­ity dis­tri­bu­tion you have over your ran­dom num­bers—it should con­tain cor­re­la­tions and struc­ture. But once you know what your prob­a­bil­ity dis­tri­bu­tion is, then you should do your work rel­a­tive to that, rather than as­sum­ing “worst case”. Op­ti­miz­ing for the worst case in en­vi­ron­ments that aren’t ac­tu­ally ad­ver­sar­ial, makes even less sense than as­sum­ing the en­vi­ron­ment is as ran­dom and un­struc­tured as ther­mal noise.

Note that rea­son­ing about the prob­a­bil­ity dis­tri­bu­tion of the en­vi­ron­ment re­quires as­sum­ing a par­tic­u­lar dis­tri­bu­tion in the first place. I want to ar­gue that this is of­ten bad from a soft­ware de­sign per­spec­tive. First, soft­ware is sup­posed to be mod­u­lar. By this, I mean that soft­ware should be bro­ken into com­po­nents that:

1. Have a clearly speci­fied and ver­ifi­able in­put-out­put con­tract.

2. Are re-us­able in a wide va­ri­ety of situ­a­tions (for the most part, if the in­put-out­put con­tract can only be satis­fied in one in­stance ever then it isn’t a very use­ful con­tract).

Pretty much ev­ery­one who works on soft­ware agrees with this. How­ever, re­quiring that the in­puts to a piece of soft­ware fol­low some prob­a­bil­ity dis­tri­bu­tion is the op­po­site of be­ing mod­u­lar. Why is this? First, it is im­pos­si­ble to ver­ify from a sin­gle in­put whether the in­put con­tract is satis­fied, since the con­tract is now over the dis­tri­bu­tion of in­puts, rather than a sin­gle in­put. Se­cond, re­quiring a cer­tain dis­tri­bu­tion over in­puts limits the soft­ware to the cases where the in­puts ac­tu­ally fol­low that dis­tri­bu­tion (or a suffi­ciently similar dis­tri­bu­tion), thus greatly limit­ing the scope of ap­pli­ca­bil­ity. Even worse, such a dis­tri­bu­tional re­quire­ment prop­a­gates back­ward through the en­tire sys­tem: if A calls B, which calls C, and C re­quires its in­puts to fol­low a cer­tain prob­a­bil­ity dis­tri­bu­tion, then B’s con­tract will prob­a­bly also have to call for a cer­tain prob­a­bil­ity dis­tri­bu­tion over in­puts (un­less all in­puts to C from B are gen­er­ated with­out refer­ence to B’s in­puts). On the other hand, soft­ware op­er­at­ing un­der worst-case analy­ses will (by defi­ni­tion) work re­gard­less of in­put, and will not gen­er­ate as­sump­tions that prop­a­gate back to the rest of the sys­tem. There­fore it is im­por­tant to be will­ing to adopt a worst-case per­spec­tive (or at the very least, some­thing weaker than as­sum­ing a full dis­tri­bu­tion over in­puts), at least in this in­stance. If one wants to ar­gue that the ran­dom­ness in this sce­nario is only use­ful prac­ti­cally but that some­thing de­ter­minis­tic would “in prin­ci­ple” be bet­ter, then they would have to also ar­gue that mod­u­lar­ity of soft­ware is use­ful prac­ti­cally but some­thing non-mod­u­lar would “in prin­ci­ple” be bet­ter.

3. The­o­ret­i­cal Properties

The P vs. BPP ques­tion is the ques­tion of whether, when­ever there is a polyno­mial-time ran­dom­ized al­gorithm that be­haves cor­rectly on ev­ery in­di­vi­d­ual in­put with prob­a­bil­ity 99.9999999999%, there is also a polyno­mial-time de­ter­minis­tic al­gorithm that be­haves cor­rectly on ev­ery in­put. If ran­dom­ness was never use­ful, we would ex­pect the an­swer to this ques­tion to be straight­for­wardly yes—af­ter all, what­ever the ran­dom al­gorithm was do­ing, we should be able to im­prove upon it, and since a de­ter­minis­tic al­gorithm is ei­ther right or wrong, the only way to im­prove upon 99.9999999999% is to go all the way up to 100%. In re­al­ity, how­ever, this is a long-stand­ing open prob­lem in com­plex­ity the­ory that is gen­er­ally be­lieved to be true, but which is far from proven.

This point was already raised by Scott in his origi­nal thread, but did not seem to gain much trac­tion. Sev­eral peo­ple were sus­pi­cious of the fact that the ran­dom­ized al­gorithm only had to work al­most all of the time on each in­put, while the de­ter­minis­tic al­gorithm had to work all of the time on each in­put. I have sev­eral re­sponses to this. The first, some­what flip­pant, re­sponse, is that we could of course only re­quire that the de­ter­minis­tic al­gorithm work al­most all of the time for each in­put, as well; but a de­ter­minis­tic al­gorithm can only get the an­swer right or wrong, so “al­most always” and “always” amount to the same thing for a de­ter­mistic al­gorithm.

Se­condly, let us con­sider what it should mean for an al­gorithm to work “for all prac­ti­cal pur­poses”. If I have a ran­dom­ized al­gorithm, and its prob­a­bil­ity of failure is so small as to be neg­ligible com­pared to, say, cos­mic rays hit­ting the com­puter and cor­rupt­ing the out­put, then this seems to meet the bar of “for all prac­ti­cal pur­poses” (thus mo­ti­vat­ing the defi­ni­tion of BPP). On the other hand, what should “all prac­ti­cal pur­poses” mean for a de­ter­minis­tic al­gorithm? Some might ar­gue that if, on av­er­age across in­puts to the al­gorithm, the out­put is very likely to be cor­rect, then that should satisfy the re­quire­ment—but this raises the ques­tion, what av­er­age should we use? An ad­di­tion al­gorithm that thinks that 1+1=3 but is oth­er­wise cor­rect should surely not meet the “all prac­ti­cal pur­poses” bar, so the uniform dis­tri­bu­tion over in­puts cer­tainly shouldn’t be used. So, per­haps I should be figur­ing out what I ex­pect my in­put dis­tri­bu­tion to look like, and mea­sure ac­cord­ing to that? But now, you see, we have added bag­gage that wasn’t in the origi­nal prob­lem defi­ni­tion. All I said is that I had some ab­stract al­gorith­mic prob­lem that I wanted to solve for (some suit­able defi­ni­tion of) all prac­ti­cal pur­poses. You’ve now told me that in or­der to do this, I need to figure out what the in­put dis­tri­bu­tion to the al­gorithm is go­ing to be. But this is pretty un­rea­son­able! Can you imag­ine an un­der­grad­u­ate com­puter sci­ence text­book full of al­gorithms that would work, but only if its in­puts fol­lowed such-and-such a dis­tri­bu­tion? (I’m not claiming that such a guaran­tee is never use­ful, only dis­put­ing the no­tion that it should be raised up as the gold stan­dard of al­gorithms on the same level as P and BPP.) I don’t think any­one would learn much from such a text­book, and com­puter sci­ence would be in a pretty woe­ful state if this was all we could offer. I there­fore con­tend that the only re­ally rea­son­able no­tion of de­ter­minis­ti­cally solv­ing a prob­lem for all prac­ti­cal pur­poses is that it should always work cor­rectly on all pos­si­ble in­puts, thus lead­ing to the com­plex­ity class P. This is all a long-winded way of say­ing that the P vs. BPP ques­tion re­ally is the right way to for­mally ask whether ran­dom­ness can help from an al­gorith­mic stand­point.

I’d also like to spe­cially ex­am­ine what hap­pens if we use a Solomonoff prior (as­sign­ing prob­a­bil­ity

to pro­grams of length k). If there is any pro­gram of length L that solves the prob­lem cor­rectly, then any de­ter­minis­tic al­gorithm of length M that is in­cor­rect with prob­a­bil­ity much less than with re­spect to the Solomonoff prior must always work (since we can take the pro­gram that runs the two al­gorithms against each other un­til it finds a dis­crep­ancy, then uses that as the in­put). There­fore, “al­most always” with re­spect to Solomonoff is more or less the same as “always”, so we haven’t re­ally done any­thing differ­ent than re­define P in a very round­about way. Cru­cially, this means that even if we are fully Bayesian, un­der a Solomonoff prior the ques­tion of whether ev­ery ran­dom­ized polyno­mial-time al­gorithm has a de­ter­minis­tic coun­ter­part still boils down to the un­re­solved and deep P vs. BPP ques­tion.

[Ad­den­dum: it looks like Peter de Blanc and Wei Dai have made this last point already, see here and here.)

4. Use­ful­ness in prob­lem-solving

There are im­por­tant com­bi­na­to­rial prop­er­ties, such as ex­pan­sion prop­er­ties of graphs, which are satis­fied by ran­dom graphs but which we don’t have any de­ter­minis­tic con­struc­tions for at all (or in some cases, we do, but they came much later and were harder to come by). Many im­por­tant prob­lems in graph the­ory, Ram­sey the­ory, etc. were solved by con­sid­er­ing ran­dom com­bi­na­to­rial ob­jects (this was one of the great triumphs of Paul Er­dos) and think­ing in purely de­ter­minis­tic terms seems very un­likely to have solved these prob­lems. Tak­ing a ran­dom/​pseu­do­ran­dom per­spec­tive has even led to some of the mod­ern triumphs of math­e­mat­ics such as the Green-Tao the­o­rem (there are ar­bi­trar­ily long ar­ith­metic pro­gres­sions) and is fun­da­men­tal to the­o­ret­i­cal cryp­tog­ra­phy, game the­ory, con­vex anal­y­sis, and com­bi­na­to­rial ap­prox­i­ma­tion al­gorithms. If we aban­don the use of ran­dom con­struc­tions, then we also aban­don some of the most im­por­tant work in mod­ern com­puter sci­ence, eco­nomics, and ap­plied math­e­mat­ics, which seems un­ac­cept­able to me.

On the al­gorith­mic side, even if you are Bayesian, you want to some­how rep­re­sent your pos­te­rior dis­tri­bu­tion, and one of the sim­plest ways to do so is by ran­dom sam­ples. An al­ter­na­tive way is by a prob­a­bil­ity den­sity func­tion, but un­der ex­tremely widely held com­plex­ity-the­o­retic as­sump­tions there are many prob­a­bil­ity dis­tri­bu­tions that can be effi­ciently rep­re­sented by sam­ples but whose prob­a­bil­ity den­sity func­tion can­not be effi­ciently writ­ten down. More prac­ti­cally, the view of rep­re­sent­ing dis­tri­bu­tions by ran­dom sam­ples has led to many im­por­tant al­gorithms such as se­quen­tial Monte Carlo (SMC) and Markov Chain Monte Carlo (MCMC), which form some of the cor­ner­stones of mod­ern statis­ti­cal in­fer­ence (MCMC in par­tic­u­lar has been called one of the 10 great­est al­gorithms of the 20th cen­tury). Again, aban­don­ing this work on philo­soph­i­cal grounds seems un­ac­cept­able.

Of course, it is the­o­ret­i­cally pos­si­ble that all of the above gains will even­tu­ally be re­pro­duced with­out us­ing ran­dom­ness. How­ever, it seems im­por­tant to note that many of these are among the most beau­tiful and el­e­gant re­sults in their re­spec­tive fields. I would ob­ject to any char­ac­ter­i­za­tion of them as “hacks” that come down to a poorly-un­der­stood, lucky, but use­ful re­sult.

  • The ob­sta­cles to “de­ran­dom­iz­ing” such things ap­pear to be quite deep, as ex­tremely in­tel­li­gent peo­ple have been work­ing on them for decades with­out suc­cess. If one wants to say that a ran­dom­iza­tion-free, Bayesian ap­proach is “cor­rect even if not prac­ti­cal,” this be­comes true for a very ex­pan­sive defi­ni­tion of “prac­ti­cal,” and one must con­cede that be­ing a Bayesian is of­ten com­pletely im­prac­ti­cal even when a great deal of time and re­sources are available for rea­son­ing.

  • Think­ing in terms of ran­dom­ness is ex­tremely use­ful in ar­riv­ing at these deep and im­por­tant re­sults, and one of the main pur­poses of a paradigm is to provide modes of thought that lead to such re­sults. I would there­fore ar­gue that ran­dom­iza­tion should be a fun­da­men­tal lens (among many) with which to ap­proach prob­lems.

Conclusion

I have given sev­eral ex­am­ples of how ran­dom­ness is use­ful in both com­mon­place and fun­da­men­tal ways. A pos­si­ble ob­jec­tion is that yes, ran­dom­ness might be use­ful from a prag­matic per­spec­tive, but it is still the case that in the­ory we can always do bet­ter with­out ran­dom­ness. I don’t think that such an ob­jec­tion an­swers the above ar­gu­ments, ex­cept per­haps in some very weak sense. We saw how, in rock-pa­per-scis­sors, ran­dom­ness is always go­ing to be use­ful for one of the two play­ers (un­less both play­ers are equally strong); we saw how us­ing ran­dom­ness is nec­es­sary to make cer­tain prof­itable trades; and we saw how, even if we adopt a Solomonoff prior, the ques­tion of whether or not de­ter­minis­tic al­gorithms can com­pete with ran­dom­ized al­gorithms is still open (and the likely re­s­olu­tion will rely upon con­struct­ing suffi­ciently pseu­do­ran­dom num­bers, rather than rea­son­ing care­fully about the dis­tri­bu­tion of in­puts). Even if we dis­card all prag­matic con­cerns what­so­ever, the above ex­am­ples at the very least show that ran­dom­iza­tion is use­ful in a very deep and fun­da­men­tal way. And im­por­tantly, at least from my per­spec­tive, we have also seen how ran­dom­iza­tion has been es­sen­tial in re­solv­ing im­por­tant prob­lems across a va­ri­ety of fields. This, if noth­ing else, is a com­pel­ling case that it should be adopted as a fea­ture in one’s wor­ld­view, rather than dis­carded on philo­soph­i­cal grounds.

Acknowledgements

Thanks to Holden Karnofsky for read­ing an early ver­sion of this es­say and sug­gest­ing the ran­dom­ized con­trol­led tri­als ex­am­ple. Thanks also to Sindy Li for sug­gest­ing the idea of us­ing ran­dom­ness to im­prove mar­ket liquidity.