The Weighted Majority Algorithm

Fol­lowup to: Worse Than Ran­dom, Trust In Bayes

In the wider field of Ar­tifi­cial In­tel­li­gence, it is not uni­ver­sally agreed and ac­knowl­edged that noise hath no power. In­deed, the con­ven­tional view in ma­chine learn­ing is that ran­dom­ized al­gorithms some­times perform bet­ter than un­ran­dom­ized coun­ter­parts and there is noth­ing pe­cu­liar about this. Thus, read­ing an or­di­nary pa­per in the AI liter­a­ture, you may sud­denly run across a re­mark: “There is also an im­proved ver­sion of this al­gorithm, which takes ad­van­tage of ran­dom­iza­tion...”

Now for my­self I will be in­stantly sus­pi­cious; I shall in­spect the math for rea­sons why the un­ran­dom­ized al­gorithm is be­ing some­where stupid, or why the ran­dom­ized al­gorithm has a hid­den dis­ad­van­tage. I will look for some­thing pe­cu­liar enough to ex­plain the pe­cu­liar cir­cum­stance of a ran­dom­ized al­gorithm some­how do­ing bet­ter.

I am not com­pletely alone in this view. E. T. Jaynes, I found, was of the same mind: “It ap­pears to be a quite gen­eral prin­ci­ple that, when­ever there is a ran­dom­ized way of do­ing some­thing, then there is a non­ran­dom­ized way that de­liv­ers bet­ter perfor­mance but re­quires more thought.” Ap­par­ently there’s now a small cot­tage in­dus­try in de­ran­dom­iz­ing al­gorithms. But so far as I know, it is not yet the ma­jor­ity, main­stream view that “we can im­prove this al­gorithm by ran­dom­iz­ing it” is an ex­tremely sus­pi­cious thing to say.

Let us now con­sider a spe­cific ex­am­ple—a main­stream AI al­gorithm where there is, ap­par­ently, a math­e­mat­i­cal proof that the ran­dom­ized ver­sion performs bet­ter. By show­ing how sub­tle the gotcha can be, I hope to con­vince you that, even if you run across a case where the ran­dom­ized al­gorithm is widely be­lieved to perform bet­ter, and you can’t find the gotcha your­self, you should nonethe­less trust that there’s a gotcha to be found.

For our par­tic­u­lar ex­am­ple we shall ex­am­ine the weighted ma­jor­ity al­gorithm, first in­tro­duced by Lit­tle­stone and War­muth, but as a sub­stan­tially more read­able on­line refer­ence I will use sec­tion 2 of Blum’s On-Line Al­gorithms in Ma­chine Learn­ing. (Blum has an eas­ier-to-read, more con­crete ver­sion of the proofs; and the un­ran­dom­ized and ran­dom­ized al­gorithms are pre­sented side-by-side rather than far part.)

The weighted ma­jor­ity al­gorithm is an en­sem­ble method—a way to com­bine the ad­vice from sev­eral other al­gorithms or hy­pothe­ses, called “ex­perts”. Ex­perts strive to clas­sify a set of en­vi­ron­men­tal in­stances into “pos­i­tive” and “nega­tive” cat­e­gories; each ex­pert pro­duces a pre­dic­tion for each in­stance. For ex­am­ple, each ex­pert might look at a photo of a for­est, and try to de­ter­mine whether or not there is a cam­ou­flaged tank in the for­est. Ex­pert pre­dic­tions are bi­nary, ei­ther “pos­i­tive” or “nega­tive”, with no prob­a­bil­ity at­tached. We do not as­sume that any of our ex­perts is perfect, and we do not know which of our ex­perts is the best. We also do not as­sume any­thing about the sam­ples (they may not be in­de­pen­dent and iden­ti­cally dis­tributed).

The weighted ma­jor­ity al­gorithm ini­tially as­signs an equal weight of 1 to all ex­perts. On each round, we ask all the ex­perts for their pre­dic­tions, and sum up the weights for each of the two pos­si­ble pre­dic­tions, “pos­i­tive” or “nega­tive”. We out­put the pre­dic­tion that has the higher weight. Then, when we see the ac­tual re­sult, we mul­ti­ply by 12 the weight of ev­ery ex­pert that got the pre­dic­tion wrong.

Sup­pose the to­tal num­ber of ex­perts is n, and the best ex­pert makes no more than m mis­takes over some given sam­pling se­quence. Then we can prove that the weighted ma­jor­ity al­gorithm makes a to­tal num­ber of mis­takes M that is bounded by 2.41*(m + log2(n)). In other words, the weighted ma­jor­ity al­gorithm makes no more mis­takes than the best ex­pert, plus a fac­tor of log n, times a con­stant fac­tor of 2.41.

Proof (taken from Blum 1996; a similar proof ap­pears in Lit­tle­stone and War­muth 1989): The com­bined weight of all the ex­perts at the start of the prob­lem is W = n. If the weighted ma­jor­ity al­gorithm makes a mis­take, at least half the to­tal weight of ex­perts pre­dicted in­cor­rectly, so the to­tal weight is re­duced by a fac­tor of at least 14. Thus, af­ter the weighted ma­jor­ity al­gorithm makes M mis­takes, its to­tal weight W has been re­duced by at least (3/​4)^M. In other words:

W ⇐ n*( (3/​4)^M )

But if the best ex­pert has made at most m mis­takes, its weight is at least (1/​2)^m. And since W in­cludes the weight of all ex­perts,

W >= (1/​2)^m

There­fore:

(1/​2)^m ⇐ W ⇐ n*( (3/​4)^M )
(1/​2)^m ⇐ n*( (3/​4)^M )
-m ⇐ log2(n) + M*log2(3/​4)
-m ⇐ log2(n) + M*-log2(4/​3)
M*log2(4/​3) ⇐ m + log2(n)
M ⇐ (1/​log2(4/​3))*(m + log2(n))
M ⇐ 2.41*(m + log2(n))

Blum then says that “we can achieve a bet­ter bound than that de­scribed above”, by ran­dom­iz­ing the al­gorithm to pre­dict “pos­i­tive” or “nega­tive” with prob­a­bil­ity pro­por­tional to the weight as­signed each pre­dic­tion. Thus, if 34 of the ex­pert weight went to “pos­i­tive”, we would pre­dict “pos­i­tive” with prob­a­bil­ity 75%, and “nega­tive” with prob­a­bil­ity 25%.

An es­sen­tially similar proof, sum­ming over the ex­pected prob­a­bil­ity of a mis­take on each round, will show that in this case:

M ⇐ 1.39m + 2 ln(n) (note: M is now an ex­pec­ta­tion)

Since the penalty ap­plied to par­tic­u­lar ex­perts does not de­pend on the global pre­dic­tion but only on the ac­tual out­come, most of the proof pro­ceeds as be­fore. We have

W >= (1/​2)^m

where again m is the best ex­pert and W is the to­tal weight of all ex­perts.

We also have that W is the start­ing weight n times the product of (1 − 12 F_i), where F_i is the frac­tion of mis­taken ex­pert weight on round i:

W = n * Prod_i (1 − 12 F_i)

And since we pre­dict with prob­a­bil­ity pro­por­tional to the ex­pert weight, the ex­pected num­ber of mis­takes is just the sum over F_i:

M = Sum_i F_i

So:

(1/​2)^m ⇐ n * Prod_i (1 − 12 F_i)
-m*ln(2) ⇐ ln(n) + Sum_i ln(1 − 12 F_i)
Sum_i -ln(1 − 12 F_i) ⇐ ln(n) + m*ln(2)
Sum_i (1/​2 F_i) ⇐ ln(n) + m*ln(2)
(be­cause x < -ln(1 - x) )
Sum_i F_i ⇐ 2*(ln(n) + m*ln(2))
M ⇐ 1.39m + 2 ln(n)

Be­hold, we have done bet­ter by ran­dom­iz­ing!

We should be es­pe­cially sus­pi­cious that the ran­dom­ized al­gorithm guesses with prob­a­bil­ity pro­por­tional to the ex­pert weight as­signed. This seems strongly rem­i­nis­cent of bet­ting with 70% prob­a­bil­ity on blue, when the en­vi­ron­ment is a ran­dom mix of 70% blue and 30% red cards. We know the best bet—and yet we only some­times make this best bet, at other times bet­ting on a con­di­tion we be­lieve to be less prob­a­ble.

Yet we thereby prove a smaller up­per bound on the ex­pected er­ror. Is there an alge­braic er­ror in the sec­ond proof? Are we ex­tract­ing use­ful work from a noise source? Is our knowl­edge harm­ing us so much that we can do bet­ter through ig­no­rance?

Maybe the un­ran­dom­ized al­gorithm fails to take into ac­count the Bayesian value of in­for­ma­tion, and hence fails to ex­hibit ex­plo­ra­tory be­hav­ior? Maybe it doesn’t test un­usual cases to see if they might have sur­pris­ing re­sults?

But, ex­am­in­ing the as­sump­tions, we see that the feed­back we re­ceive is fixed, re­gard­less of the pre­dic­tion’s out­put. Nor does the up­dat­ing of the ex­pert weights de­pend on the pre­dic­tions we out­put. It doesn’t mat­ter whether we sub­sti­tute a com­pletely ran­dom string of 1s and 0s as our ac­tual out­put. We get back ex­actly the same data from the en­vi­ron­ment, and the weighted ma­jor­ity al­gorithm up­dates the ex­pert weights in ex­actly the same way. So that can’t be the ex­pla­na­tion.

Are the com­po­nent ex­perts do­ing worse than ran­dom, so that by ran­dom­iz­ing our pre­dic­tions, we can creep back up to­ward max­i­mum en­tropy? But we didn’t as­sume any­thing about how of­ten the com­po­nent ex­perts were right, or use the fact in the proofs. There­fore the proofs would carry even if we speci­fied that all ex­perts were right at least 60% of the time. It’s hard to see how ran­dom­iz­ing our pre­dic­tions could help, in that case—but the proofs still go through, un­changed.

So where’s the gotcha?

Maybe I’m about to tell you that I looked, and I couldn’t find the gotcha ei­ther. What would you be­lieve, in that case? Would you think that the whole the­sis on the fu­til­ity of ran­dom­ness was prob­a­bly just wrong—a rea­son­able-sound­ing philo­soph­i­cal ar­gu­ment that sim­ply wasn’t cor­rect in prac­tice? Would you de­spair of be­ing able to fol­low the math, and give up and shrug, un­able to de­cide who might be right?

We don’t always see the flaws right away, and that’s some­thing to always re­mem­ber.

In any case, I’m about to start ex­plain­ing the gotcha. If you want to go back, read the pa­per, and look your­self, you should stop read­ing now...

There does ex­ist a rare class of oc­ca­sions where we want a source of “true” ran­dom­ness, such as a quan­tum mea­sure­ment de­vice. For ex­am­ple, you are play­ing rock-pa­per-scis­sors against an op­po­nent who is smarter than you are, and who knows ex­actly how you will be mak­ing your choices. In this con­di­tion it is wise to choose ran­domly, be­cause any method your op­po­nent can pre­dict will do worse-than-av­er­age. As­sum­ing that the en­emy knows your source code has strange con­se­quences: the ac­tion se­quence 1, 2, 1, 3 is good when de­rived from a quan­tum noise gen­er­a­tor, but bad when de­rived from any de­ter­minis­tic al­gorithm, even though it is the same ac­tion se­quence. Still it is not a to­tally un­re­al­is­tic situ­a­tion. In real life, it is, in fact, a bad idea to play rock-pa­per-scis­sors us­ing an al­gorithm your op­po­nent can pre­dict. So are we, in this situ­a­tion, de­riv­ing op­ti­miza­tion from noise?

The ran­dom rock-pa­per-scis­sors player does not play clev­erly, rack­ing up lots of points. It does not win more than 13 of the time, or lose less than 13 of the time (on av­er­age). The ran­dom­ized player does bet­ter be­cause its al­ter­na­tives perform poorly, not from be­ing smart it­self. More­over, by as­sump­tion, the op­po­nent is an in­tel­li­gence whom we can­not out­smart and who always knows ev­ery­thing about any method we use. There is no move we can make that does not have a pos­si­ble coun­ter­move. By as­sump­tion, our own in­tel­li­gence is en­tirely use­less. The best we can do is to avoid all reg­u­lar­ity that the en­emy can ex­ploit. In other words, we do best by min­i­miz­ing the effec­tive­ness of in­tel­li­gence within the sys­tem-as-a-whole, be­cause only the en­emy’s in­tel­li­gence is al­lowed to be effec­tive. If we can’t be clever our­selves, we might as well think of our­selves as the en­vi­ron­ment and the en­emy as the sole in­tel­li­gence within that en­vi­ron­ment. By be­com­ing the max­i­mum-en­tropy en­vi­ron­ment for rock-pa­per-scis­sors, we ren­der all in­tel­li­gence use­less, and do (by as­sump­tion) the best we can do.

When the en­vi­ron­ment is ad­ver­sar­ial, smarter than you are, and in­formed about your meth­ods, then in a the­o­ret­i­cal sense it may be wise to have a quan­tum noise source handy. (In a prac­ti­cal sense you’re just screwed.) Again, this is not be­cause we’re ex­tract­ing use­ful work from a noise source; it’s be­cause the most pow­er­ful in­tel­li­gence in the sys­tem is ad­ver­sar­ial, and we’re min­i­miz­ing the power that in­tel­li­gence can ex­ert in the sys­tem-as-a-whole. We don’t do bet­ter-than-av­er­age, we merely min­i­mize the ex­tent to which the ad­ver­sar­ial in­tel­li­gence pro­duces an out­come that is worse-than-av­er­age (from our per­spec­tive).

Similarly, cryp­tog­ra­phers have a le­gi­t­i­mate in­ter­est in strong ran­dom­ness gen­er­a­tors be­cause cryp­tog­ra­phers are try­ing to min­i­mize the effec­tive­ness of an in­tel­li­gent ad­ver­sary. Cer­tainly en­tropy can act as an an­ti­dote to in­tel­li­gence.

Now back to the weighted ma­jor­ity al­gorithm. Blum (1996) re­marks:

“In­tu­itively, the ad­van­tage of the ran­dom­ized ap­proach is that it dilutes the worst case. Pre­vi­ously, the worst case was that slightly more than half of the to­tal weight pre­dicted in­cor­rectly, caus­ing the al­gorithm to make a mis­take and yet only re­duce the to­tal weight by 14. Now there is roughly a 5050 chance that the [ran­dom­ized] al­gorithm will pre­dict cor­rectly in this case, and more gen­er­ally, the prob­a­bil­ity that the al­gorithm makes a mis­take is tied to the amount that the weight is re­duced.”

From the start, we did our anal­y­sis for an up­per bound on the num­ber of mis­takes made. A global up­per bound is no bet­ter than the worst in­di­vi­d­ual case; thus, to set a global up­per bound we must bound the worst in­di­vi­d­ual case. In the worst case, our en­vi­ron­ment be­haves like an ad­ver­sar­ial su­per­in­tel­li­gence.

In­deed ran­dom­ness can im­prove the worst-case sce­nario, if the worst-case en­vi­ron­ment is al­lowed to ex­ploit “de­ter­minis­tic” moves but not “ran­dom” ones. It is like an en­vi­ron­ment that can de­cide to pro­duce a red card when­ever you bet on blue—un­less you make the same bet us­ing a “ran­dom” num­ber gen­er­a­tor in­stead of your cre­ative in­tel­li­gence.

Sup­pose we use a quan­tum mea­sure­ment de­vice to pro­duce a string of ran­dom ones and ze­roes; make two copies of this string; use one copy for the weighted ma­jor­ity al­gorithm’s ran­dom num­ber gen­er­a­tor; and give an­other copy to an in­tel­li­gent ad­ver­sary who picks our sam­ples. In other words, we let the weighted ma­jor­ity al­gorithm make ex­actly the same ran­dom­ized pre­dic­tions, pro­duced by ex­actly the same gen­er­a­tor, but we also let the “worst case” en­vi­ron­ment know what these ran­dom­ized pre­dic­tions will be. Then the im­proved up­per bound of the ran­dom­ized ver­sion is math­e­mat­i­cally in­val­i­dated.

This shows that the im­proved up­per bound proven for the ran­dom­ized al­gorithm did not come from the ran­dom­ized al­gorithm mak­ing sys­tem­at­i­cally bet­ter pre­dic­tions—do­ing su­pe­rior cog­ni­tive work, be­ing more in­tel­li­gent—but be­cause we ar­bi­trar­ily de­clared that an in­tel­li­gent ad­ver­sary could read our mind in one case but not the other.

This is not just a minor quib­ble. It leads to the testable pre­dic­tion that on real-world prob­lems, where the en­vi­ron­ment is usu­ally not an ad­ver­sar­ial telepath, the un­ran­dom­ized weighted-ma­jor­ity al­gorithm should do bet­ter than the ran­dom­ized ver­sion. (Pro­vided that some com­po­nent ex­perts out­perform max­i­mum en­tropy—are right more than 50% of the time on bi­nary prob­lems.)

An­a­lyz­ing the worst-case sce­nario is stan­dard prac­tice in com­pu­ta­tional learn­ing the­ory, but it makes the math do strange things. More­over, the as­sump­tion of the worst case can be sub­tle; it is not always ex­plic­itly la­beled. Con­sider the up­per bound for the un­ran­dom­ized weighted-ma­jor­ity al­gorithm. I did not use the term “worst case”—I merely wrote down some in­equal­ities. In fact, Blum (1996), when ini­tially in­tro­duc­ing this bound, does not at first use the phrase “worst case”. The proof’s con­clu­sion says only:

“The­o­rem 1. The num­ber of mis­takes M made by the Weighted Ma­jor­ity al­gorithm de­scribed above is never more than 2.41(m+lg n), where m is the num­ber of mis­takes made by the best ex­pert so far.”

Key word: Never.

The worst-case as­sump­tion for the un­ran­dom­ized al­gorithm was im­plicit in calcu­lat­ing the right-hand-side of the in­equal­ity by as­sum­ing that, on each mis­take, the to­tal weight went down by a fac­tor of 14, and the to­tal weight never de­creased af­ter any suc­cess­ful pre­dic­tion. This is the ab­solute worst pos­si­ble perfor­mance the weighted-ma­jor­ity al­gorithm can give.

The as­sump­tion im­plicit in those in­no­cent-look­ing equa­tions is that the en­vi­ron­ment care­fully max­i­mizes the an­guish of the pre­dic­tor: The en­vi­ron­ment so clev­erly chooses its data sam­ples, that on each case where the weighted-ma­jor­ity al­gorithm is al­lowed to be suc­cess­ful, it shall re­ceive not a sin­gle iota of use­ful feed­back—not a sin­gle ex­pert shall be wrong. And the weighted-ma­jor­ity al­gorithm will be made to err on sen­sory cases that pro­duce the min­i­mum pos­si­ble use­ful feed­back, mal­i­ciously fine-tuned to the ex­act cur­rent weight­ing of ex­perts. We as­sume that the en­vi­ron­ment can pre­dict ev­ery as­pect of the pre­dic­tor and ex­ploit it—un­less the pre­dic­tor uses a “ran­dom” num­ber gen­er­a­tor which we ar­bi­trar­ily de­clare to be un­know­able to the ad­ver­sary.

What strange as­sump­tions are buried in that in­no­cent lit­tle in­equal­ity,

<=

More­over, the en­tire ar­gu­ment in fa­vor of the ran­dom­ized al­gorithm was the­o­ret­i­cally sus­pect from the be­gin­ning, be­cause it rested on non-tran­si­tive in­equal­ities. If I prove an up­per bound on the er­rors of al­gorithm X, and then prove a smaller up­per bound on the er­rors of al­gorithm Y, it does not prove that in the real world Y will perform bet­ter than X. For ex­am­ple, I prove that X can­not do worse than 1000 er­rors, and that Y can­not do worse than 100 er­rors. Is Y a bet­ter al­gorithm? Not nec­es­sar­ily. To­mor­row I could find an im­proved proof which shows that X can­not do worse than 90 er­rors. And then in real life, both X and Y could make ex­actly 8 er­rors.

4 ⇐ 1,000,000,000
9 ⇐ 10

But that doesn’t mean that 4 > 9.

So the next time you see an au­thor re­mark, “We can fur­ther im­prove this al­gorithm us­ing ran­dom­iza­tion...” you may not know ex­actly where to find the odd­ity. If you’d run across the above-refer­enced ex­am­ple (or any num­ber of oth­ers in the ma­chine-learn­ing liter­a­ture), you might not have known how to de­con­struct the ran­dom­ized weighted-ma­jor­ity al­gorithm. If such a thing should hap­pen to you, then I hope that I have given you grounds to sus­pect that an odd­ity ex­ists some­where, even if you can­not find it—just as you would sus­pect a ma­chine that pro­posed to ex­tract work from a heat bath, even if you couldn’t keep track of all the gears and pipes.

Nomin­ull put it very com­pactly when he said that, bar­ring an en­vi­ron­ment which changes based on the form of your al­gorithm apart from its out­put, “By adding ran­dom­ness to your al­gorithm, you spread its be­hav­iors out over a par­tic­u­lar dis­tri­bu­tion, and there must be at least one point in that dis­tri­bu­tion whose ex­pected value is at least as high as the av­er­age ex­pected value of the dis­tri­bu­tion.”

As I re­marked in Per­pet­ual Mo­tion Beliefs:

I once knew a fel­low who was con­vinced that his sys­tem of wheels and gears would pro­duce re­ac­tion­less thrust, and he had an Ex­cel spread­sheet that would prove this—which of course he couldn’t show us be­cause he was still de­vel­op­ing the sys­tem. In clas­si­cal me­chan­ics, vi­o­lat­ing Con­ser­va­tion of Mo­men­tum is prov­ably im­pos­si­ble. So any Ex­cel spread­sheet calcu­lated ac­cord­ing to the rules of clas­si­cal me­chan­ics must nec­es­sar­ily show that no re­ac­tion­less thrust ex­ists—un­less your ma­chine is com­pli­cated enough that you have made a mis­take in the calcu­la­tions.

If you ever find your­self math­e­mat­i­cally prov­ing that you can do bet­ter by ran­dom­iz­ing, I sug­gest that you sus­pect your calcu­la­tions, or sus­pect your in­ter­pre­ta­tion of your as­sump­tions, be­fore you cel­e­brate your ex­trac­tion of work from a heat bath.

• What about Monte Carlo meth­ods? There are many prob­lems for which Monte Carlo in­te­gra­tion is the most effi­cient method available.

(you are of course free to sug­gest and to sus­pect any­thing you like; I will, how­ever, point out that sus­pi­cion is no sub­sti­tute for build­ing some­thing that ac­tu­ally works...)

• There are many prob­lems for which Monte Carlo in­te­gra­tion is the most effi­cient method available.

(Em­pha­sis mine.) I know I’m late to the party, but I just no­ticed this. While what you say it’s true, “available” in this case means “that we know of”, not “that is pos­si­ble”. I’m not an ex­pert, but IIRC the point of the MCI is that the func­tions are hard to an­a­lyze. You could in­te­grate analo­gously with­out ran­dom­iza­tion (e.g., sam­ple the in­te­gra­tion do­main on a reg­u­lar grid), and it very well might work. The prob­lem is that if the func­tion you in­te­grate hap­pens to be­have atyp­i­cally on the reg­u­lar grid with re­spect to its global be­hav­ior (e.g., if the func­tion has some reg­u­lar fea­tures with a fre­quency that ac­ci­den­tally matches that of your grid, you might sam­ple only lo­cal max­ima, and severely over­es­ti­mate the in­te­gral).

But for an in­di­vi­d­ual func­tion there cer­tainly ex­ists (for some val­ues of “cer­tainly”) an ideal sam­pling that yields an ex­cel­lent in­te­gra­tion, or some even bet­ter method that doesn’t need sam­pling. But, you need to un­der­stand the func­tion very well, and search for that perfect eval­u­a­tion strat­egy, and do this for ev­ery in­di­vi­d­ual func­tion you want to in­te­grate. (Not a math ex­pert here, but I sus­pect in the gen­eral case that grid would be im­pos­si­ble to find, even if it ex­ists, and it’s prob­a­bly very hard even for those where the solu­tion can be found.)

So what MCI does is ex­actly what Eliezer men­tions above: you ran­dom­ize the sam­pling as well as you can, to avoid as much as pos­si­ble trip­ping an un­ex­pected “res­o­nance” be­tween the grid and the func­tion’s be­hav­ior. In effect, you’re work­ing against an en­vi­ron­ment you can’t over­whelm by be­ing smart, so you brute force it and do your best to re­duce the ways the en­vi­ron­ment can over­whelm you, i.e. you try to min­i­mize the worst-case con­se­quences.

Note that for func­tions we re­ally un­der­stand, MCI is not by far the best method. If you want to in­te­grate a polyno­mial, you just com­pute the an­tideriva­tive, eval­u­ate it in two points, sub­tract, and you’re done. MCI is nice be­cause in gen­eral we don’t know how to find out the an­tideriva­tive. In anal­ogy with tic-tac-toe, in­te­gra­tion would be the game, and each func­tion is a differ­ent op­po­nent; a polyno­mial is an ad­ver­sary we can an­ti­ci­pate as well as Omega can model a hu­man—in fact, we don’t even need to eval­u­ate the func­tion, we can de­duce di­rectly what the re­sult of “its ac­tions” will be; but most func­tions (prob­a­bly al­most all) are so difficult ad­ver­saries that we can’t do bet­ter than try to limit how badly they can screw us.

(I have a vague rec­ol­lec­tion of an anec­dote about MCI: Some­one some­where was us­ing it to do a com­pli­cated mul­ti­ple in­te­gral, and got very silly re­sults. Turns out, while the pseudo-ran­dom gen­er­a­tor looked OK by it­self, when its out­put was dis­tributed through­out the com­pli­cated multi-di­men­sional space that was sam­pled, a non-ob­vi­ous reg­u­lar­ity in the gen­er­ated num­bers hap­pened to match some­thing the func­tion did. So, in effect, the func­tion beat them. In other words (ig­nor­ing that this is apoc­ryphal), you don’t even need a very smart op­po­nent to be screwed, a hard func­tion is enough… That quan­tum noise gen­er­a­tor can be very handy.)

• (I have a vague rec­ol­lec­tion of an anec­dote about MCI: Some­one some­where was us­ing it to do a com­pli­cated mul­ti­ple in­te­gral, and got very silly re­sults. Turns out, while the pseudo-ran­dom gen­er­a­tor looked OK by it­self, when its out­put was dis­tributed through­out the com­pli­cated multi-di­men­sional space that was sam­pled, a non-ob­vi­ous reg­u­lar­ity in the gen­er­ated num­bers hap­pened to match some­thing the func­tion did. So, in effect, the func­tion beat them. In other words (ig­nor­ing that this is apoc­ryphal), you don’t even need a very smart op­po­nent to be screwed, a hard func­tion is enough… That quan­tum noise gen­er­a­tor can be very handy.)

Modern PRNGs tend to be less bad than that.

• That’s not at all my do­main of ex­per­tise, so I’ll take your word for it (though I guessed as much).

That said, I meant that as half funny anec­dote, half “it’s not just the­ory, it’s real life, and you can get screwed hard by some­thing that guesses your de­ci­sion al­gorithm, even if it’s just a non-sen­tient func­tion”, not as an ar­gu­ment for/​against ei­ther MC or quan­tum noise. (Though, now that I think of it, these days there prob­a­bly are quan­tum noise gen­er­a­tors available some­where.)

• (Though, now that I think of it, these days there prob­a­bly are quan­tum noise gen­er­a­tors available some­where.)

AFAIK, there are but they’re way too slow for use in MC, so de­cent-but-fast PRNGs are still preferred.

• Also, some­times you need to run sev­eral tests with the same ran­dom se­quence for de­bug­ging pur­poses, which with true ran­dom num­bers would re­quire you to store all of them, whereas with pseudo-ran­dom num­bers you just need to use the same seed.

• Yes.
How­ever, pro­gram­ming en­vi­ron­ments ori­ented to­wards nu­mer­i­cal com­put­ing tend to use the Mersenne Twister as their de­fault PRNG.

• Mersenne Twister is the most com­monly used al­gorithm. It is fast, and gen­er­ates good re­sults. It is not ad­ver­sar­ial se­cure, but in that situ­a­tion you should be us­ing a cryp­to­graph­i­cally se­cure RNG.

Age of the PRNG al­gorithm has no bear­ing on this dis­cus­sion. (If rand() uses some­thing else, it’s for stan­dards-com­pata­bil­ity rea­sons; no­body writ­ing Monti-Carlo al­gorithms would be us­ing such PRNGs.)

• If rand() uses some­thing else, it’s for stan­dards-com­pata­bil­ity rea­sons;

Ac­tu­ally the C and POSIX stan­dard im­pose very few re­quire­ments on rand(); it would be en­tirely pos­si­ble in prin­ci­ple for it to be based on the Mersenne Twister while still com­ply­ing to the stan­dards.

• It is get­ting late, so this may be way off, and have not the time to read the pa­per.

This is also as­sum­ing finite tri­als right? Be­cause over in­finite tri­als if you have a non-zero prob­a­bil­ity of sid­ing with the wrong group of clas­sifiers, you will make in­finite mis­takes. No mat­ter how small the prob­a­bil­ities go.

It seems it is trad­ing off a bet­ter ex­pec­ta­tion for a worse real worse case.

Also are you think­ing of for­mal­is­ing an al­ter­na­tive to the in­finite SI worst case you de­scribe?

• Um, maybe I mi­s­un­der­stood what you said or the proofs, but isn’t there a much sim­pler way to re­ject the proof of su­pe­ri­or­ness of the WMA, namely that in the first proof is, as you say, the worst case sce­nario.… but the sec­ond proof is NOT. It uses ex­pected value rather than worst case, right? So it’s not so much “worst case anal­y­sis does re­ally weird things” but “not only that, the two proofs aren’t even val­idly com­pa­rable, be­cause one is worst case anal­y­sis and one in­volves ex­pected value anal­y­sis”

Or did I hor­ribly mi­s­un­der­stand ei­ther the sec­ond proof or your ar­gu­ment against it? (if so, I blame lack of sleep)

• Psy, the “worst case ex­pected value” is a bound di­rectly com­pa­rable to the “worst case value” be­cause the ac­tual tra­jec­tory of ex­pert weights is the same be­tween the two al­gorithms.

Pear­son, as an in­finite set athe­ist I see no rea­son to drag in­fini­ties into this case, and I don’t re­call any of the proofs talked about them ei­ther.

• Has any­one proved a the­o­rem on the use­less­ness of ran­dom­ness?

Nomin­ull’s sug­ges­tion:

“By adding ran­dom­ness to your al­gorithm, you spread its be­hav­iors out over a par­tic­u­lar dis­tri­bu­tion, and there must be at least one point in that dis­tri­bu­tion whose ex­pected value is at least as high as the av­er­age ex­pected value of the dis­tri­bu­tion.”

Might be a start­ing point for such a proof. How­ever it does rely on be­ing able to find that “one point” whose ex­pected value is >= the av­er­age ex­pected value of the dis­tri­bu­tion. Per­haps it is easy to prove that good points lie in a cer­tain range, but all ob­vi­ous spe­cific choices in that range are bad.

• So, the ran­dom­ized al­gorithm isn’t re­ally bet­ter than the un­ran­dom­ized one be­cause get­ting a bad re­sult from the un­ran­dom­ized one is only go­ing to hap­pen when your en­vi­ron­ment mal­i­ciously hands you a prob­lem whose fea­tures match up just wrong with the non-ran­dom choices you make, so all you need to do is to make those choices in a way that’s tremen­dously un­likely to match up just wrong with any­thing the en­vi­ron­ment hands you be­cause it doesn’t have the same sorts of pat­tern in it that the en­vi­ron­ment might in­flict on you.

Ex­cept that the defi­ni­tion of “ran­dom”, in prac­tice, is some­thing very like “gen­er­ally lack­ing the sorts of pat­terns that the en­vi­ron­ment might in­flict on you”. When peo­ple im­ple­ment “ran­dom­ized” al­gorithms, they don’t gen­er­ally do it by in­tro­duc­ing some quan­tum noise source into their sys­tem (un­less there’s a real ad­ver­sary, as in cryp­tog­ra­phy), they do it with a pseu­do­ran­dom num­ber gen­er­a­tor, which pre­cisely is a de­ter­minis­tic thing de­signed to pro­duce out­put that lacks the kinds of pat­terns we find in the en­vi­ron­ment.

So it doesn’t seem to me that you’ve offered much ar­gu­ment here against “ran­dom­iz­ing” al­gorithms as gen­er­ally prac­tised; that is, hav­ing them make choices in a way that we con­fi­dently ex­pect not to match up pes­si­mally with what the en­vi­ron­ment throws at us.

Or, less ver­bosely:

In­deed ran­dom­ness can im­prove the worst-case sce­nario, if the worst-case en­vi­ron­ment is al­lowed to ex­ploit “de­ter­minis­tic” moves but not “ran­dom” ones. What “ran­dom” means, in prac­tice, is: the sort of thing that typ­i­cal en­vi­ron­ments are not able to ex­ploit. This is not cheat­ing.

• What “ran­dom” means, in prac­tice, is: the sort of thing that typ­i­cal en­vi­ron­ments are not able to ex­ploit.

No, it means “the sort of thing which is un­cor­re­lated with the en­vi­ron­ment”. What you want is for your re­sponses to be cor­re­lated with the en­vi­ron­ment in a way that benefits you.

• I think this se­ries of posts should con­tain a dis­claimer at some point stat­ing the us­abil­ity of ran­dom­ized al­gorithms in prac­tice (even ex­clud­ing situ­a­tions like dis­tributed com­put­ing and cryp­tog­ra­phy). In a broad sense, though ran­dom­ness offers us no ad­van­tage in­for­ma­tion the­o­ret­i­cally (this is what you seem to be say­ing), ran­dom­ness does offer a lot of ad­van­tages com­pu­ta­tion­ally. This fact is not at odds with what you are say­ing, but de­serves to be stated more ex­plic­itly.

There are many al­gorith­mic tasks which be­come fea­si­ble via ran­dom­ized al­gorithms, for which no fea­si­ble de­ter­minis­tic al­ter­na­tives are known—un­til re­cently pri­mal­ity test­ing was one such prob­lem (even now, the de­ter­minis­tic pri­mal­ity test is un­us­able in prac­tice). Another cru­cial ad­van­tage of ran­dom­ized al­gorithms is that in many cases it is much eas­ier to come up with a ran­dom­ized al­gorithm and later, if needed, de­ran­dom­ize the al­gorithm. Ran­dom­ized al­gorithms are ex­tremely use­ful in tack­ling al­gorith­mic prob­lems and I think any AI that has to de­sign al­gorithms would have to be able to rea­son in this paradigm (cf. Ran­dom­ized al­gorithms, Mot­wani and Ragha­van or Prob­a­bil­ity and Com­put­ing, Mitzen­macher and Up­fal).

• To put my point an­other way...

Say your ex­perts, above, pre­dict in­stance fre­quen­cies in pop­u­la­tions. Huge pop­u­la­tions, so you can only test ex­pert ac­cu­racy by check­ing sub­sets of those pop­u­la­tions.

So, you write an al­gorithm to choose which in­di­vi­d­u­als to sam­ple from pop­u­la­tions when test­ing ex­pert’s be­liefs. Aren’t there situ­a­tions where the best al­gorithm, the one least likely to bias for any par­tic­u­lar ex­pert al­gorithm, would con­tain some ran­dom el­e­ment?

I fear ran­dom­ness is oc­ca­sion­ally nec­es­sary just to fairly check the ac­cu­racy of be­liefs about the world and over­come bias.

• Eliezer: I rec­og­nize that the two al­gorithms have the ex­act same tra­jec­tory for the weights, that the only thing that’s differ­ent be­tween them is the fi­nal out­put, not the weights. How­ever, maybe I’m be­ing clue­less here, but I still don’t see that as jus­tify­ing con­sid­er­ing the two differ­ent proof re­sults re­ally com­pa­rable.

The real worst case anal­y­sis for the sec­ond al­gorithm would be more like “the ran­dom source just hap­pened to (ex­tremely un­likely, but...) always come up with the wrong an­swer.”

Ob­vi­ously this sort of worst case anal­y­sis isn’t re­ally fair to ran­dom­ized al­gorithms, ob­vi­ously you’d have to do some­thing like “what’s the worst case prob­a­bil­ity of get­ting it wrong, blah blah blah” or some such. But then, clearly (to me, at least) it’s kinda cheat­ing to then turn around and say “okay then, now let’s com­pare that to the ac­tual worst case anal­y­sis of the de­ter­minis­tic al­gorithm and con­clude the ran­dom al­gorithm is su­pe­rior.”

On the other hand, I guess you could say that what’s be­ing done is “let’s com­pare the ex­pected value of the num­ber of mis­takes given the as­sump­tion of the most mal­i­cious pos­si­ble data. In the de­ter­minis­tic ver­sion, that hap­pens to be the ac­tual ac­tual worst case, and in the prob­a­bil­is­tic ver­sion, well, ob­vi­ously there’s other pos­si­bil­ities so it’s only the ex­pected value for the worst case given the in­puts.”

So I guess af­ter all the are com­pa­rable af­ter all.

But that only works given that one is cre­ative about what we al­low to be con­sid­ered the “worst case”… that is, if we con­sider the worst case for the in­put data, but not the worst case from the ran­dom source of bits. Which… I guess was your point all along, that it’s due to the pe­cu­liar as­sump­tions im­plicit in the no­tion of “worst case anal­y­sis”

chuck­les well, I guess this is a case in point with re­gards to the whole “dis­agree­ment de­bate”… seems when I’m in dis­agree­ment with you, I end up many times, in the pro­cess of ar­gu­ing my po­si­tion find­ing that I was the one in er­ror.

• Sure, ran­dom­ness is the lazy way out. But if com­pu­ta­tional re­sources are not in­finite, laz­i­ness is some­times the smart way out. And I said “in­finite”—even tran­shu­man quan­tum com­put­ers are finite.

• I agree with Psy-Kosh, I don’t think the proofs are com­pa­rable.

The differ­ence be­tween the two deriva­tions is that #1 uses a weak up­per bound for W, while #2 uses an equal­ity. This means that the fi­nal bound is much weaker in #1 than in #2.

It’s pos­si­ble to redo the deriva­tion for the non­ran­dom­ized ver­sion and get the ex­act same bound as for the ran­dom­ized case. The trick is to write M as the num­ber of sam­ples for which F_i > 12. Un­der weak as­sump­tions on the perfor­mance of the ex­perts you can show that this is less than Sum_i F_i.

Eliezer: thanks for the AI brain-teaser, I hope to see more posts like this in the fu­ture.

• I agree with Psy-Kosh too. The key is, as Eliezer origi­nally wrote, never. That word ap­pears in The­o­rem 1 (about the de­ter­minis­tic al­gorithm), but it does not ap­pear in The­o­rem 2 (the bound on the ran­dom­ized al­gorithm).

Ba­si­cally, this is the same in­sight Eliezer sug­gests, that the en­vi­ron­ment is be­ing al­lowed to be a su­per­in­tel­li­gent en­tity with com­plete knowl­edge in the proof for the de­ter­minis­tic bound, but the en­vi­ron­ment is not al­lowed the same pow­ers in the proof for the ran­dom­ized one.

In other words, Eliezer’s con­clu­sion is cor­rect, but I don’t think the puz­zle is as difficult as he sug­gests. I think Psy-Kosh (And Daniel) are right, that the mis­take is in be­liev­ing that the two the­o­rems are ac­tu­ally about the same, iden­ti­cal, topic. They aren’t.

• The ques­tion whether ran­dom­ized al­gorithms can have an ad­van­tage over de­ter­minis­tic al­gorithms is called the P vs BPP prob­lem. As far as I know, most peo­ple be­lieve that P=BPP, that is, ran­dom­iza­tion doesn’t ac­tu­ally add power. How­ever, no­body knows how to prove this. Fur­ther­more, there ex­ists an or­a­cle rel­a­tive to which P is not equal to BPP. There­fore any proof that P=BPP has to be non­rel­a­tiviz­ing. This rules out a lot of the more ob­vi­ous meth­ods of proof.

see: com­plex­ity zoo

• Nicely done. Quite a num­ber of points raised that es­cape many peo­ple and which are rarely raised in prac­tice.

• Some­one please tell me if I un­der­stand this post cor­rectly. Here is my at­tempt to sum­ma­rize it:

“The two text­book re­sults are re­sults speci­fi­cally about the worst case. But you only en­counter the worst case when the en­vi­ron­ment can ex­tract the max­i­mum amount of knowl­edge it can about your ‘ex­perts’, and ex­ploits this knowl­edge to worsen your re­sults. For this case (and nearby similar ones) only, ran­dom­iz­ing your al­gorithm helps, but only be­cause it de­stroys the abil­ity of this ‘ad­ver­sary’ to learn about your ex­perts. If you in­stead av­er­age over all cases, the non-ran­dom al­gorithm works bet­ter.”

Is that the ar­gu­ment?

• I am in­ter­ested in what Scott Aaron­son says to this.

I am un­con­vinced, and I agree with both the com­menters g and R above. I would say Eliezer is un­der­es­ti­mat­ing the num­ber of prob­lems where the en­vi­ron­ment gives you cor­re­lated data and where the cor­re­la­tion is es­sen­tially a dis­trac­tion. Hash func­tions are, e.g., widely used in pro­gram­ming tasks and not just by cryp­tog­ra­phers. Ran­dom­ized al­gorithms of­ten are based on non-triv­ial in­sights into the prob­lem at hand. For ex­am­ple, the in­sight in hash­ing and re­lated ap­proaches is that “two (differ­ent) ob­jects are highly un­likely to give the ex­act same re­sult when (the same) ran­dom func­tion (from a cer­tain class of func­tions) is ap­plied to both of them, and hence the re­sult of this func­tion can be used to dis­t­in­guish the two ob­jects.”

• @ com­ingstorm: Quasi Monte Carlo of­ten out­performs Monte Carlo in­te­gra­tion for prob­lems of small di­men­sion­al­ity in­volv­ing “smooth” in­te­grands. This is, how­ever, not yet rigor­ously un­der­stood. (The proven bounds on perfor­mance for medium di­men­sion­al­ity seem to be ex­tremely loose.)

Be­sides, MC doesn’t re­quire ran­dom­ness in the “Kol­mogorov com­plex­ity == length” sense, but in the “passes statis­ti­cal ran­dom­ness tests” sense. Eliezer has, as far as I can see, not talked about the var­i­ous defi­ni­tions of ran­dom­ness.

• Fas­ci­nat­ing post. I es­pe­cially like the ob­ser­va­tion that the pur­pose of ran­dom­ness in all these ex­am­ples is to defeat in­tel­li­gence. It seems that this is in­deed the pur­pose for nearly ev­ery use of ran­dom­ness that I can think of. We ran­domise pres­i­dents routes to defeat the in­tel­li­gence of an as­sasin, a cas­ino ran­domises its roulette wheels to defeat the in­tel­li­gence of the gam­bler. Even in sam­pling, we ran­domise our sam­ple to defeat our own in­tel­li­gence, which we need to treat as hos­tile for the pur­pose of the ex­per­i­ment.

• What about in com­pressed sens­ing where we want an in­co­her­ent ba­sis? I was un­der the im­pres­sion that ran­dom lin­ear pro­jec­tions was the best you could do here.

• kyb, see the dis­cus­sion of quick­sort on the other thread. Ran­dom­ness is used to pro­tect against worst-case be­hav­ior, but it’s not be­cause we’re afraid of in­tel­li­gent ad­ver­saries. It’s be­cause worst-case be­hav­ior for quick­sort hap­pens a lot. If we had a good de­scrip­tion of nat­u­rally oc­cur­ring lists, we could de­sign a de­ter­minis­tic pivot al­gorithm, but we don’t. We only have the ob­ser­va­tion sim­ple guess-the-me­dian al­gorithms perform badly on real data. It’s not ter­ribly sur­pris­ing that hu­man-built lists res­onate with hu­man-de­signed pivot al­gorithms; but the op­po­site sce­nario, where the sim­plex method works well in prac­tice is not sur­pris­ing ei­ther.

• Well Eliezer seems to be stuck in the mud here the only solu­tions to a prob­lem that he can ac­cept are ones that fit into bayesian statis­tics and a log­i­cal syl­l­o­gism. But he seems to be blithely un­aware of the vast scope of pos­si­ble ap­proaches to solv­ing any given prob­lem. Not to men­tion the post seems to be a straw man since I would find it hard to imag­ine that any real sci­en­tist or en­g­ineer would ever claim that ran­dom­ness is cat­e­gor­i­cally bet­ter then an en­g­ineered solu­tion. But I guess thats be­cause I as­so­ci­ate and work with real en­g­ineers and sci­en­tists so what would I know. Not to men­tion the state­ment that a ran­dom­ized al­gorithm can perform bet­ter is true in a limited ex­tent. The state­ment about the ran­dom­ized al­gorithm be­ing bet­ter is a spe­cific state­ment about that al­gorithm and I would imag­ine the peo­ple mak­ing it would agree that one could en­g­ineer a bet­ter solu­tion. I can’t imag­ine any real en­g­ineer mak­ing the claim that the ran­dom­ized al­gorithm will always perform bet­ter then the en­g­ineered one.

• I agree with Psy, the bounds are not com­pa­rable.

In fact, the bound for #2 should be the same as the one for #1. I mean, in the re­ally worst case, the ran­dom­ized al­gorithm makes ex­actly the same pre­dic­tions as the de­ter­minis­tic one. I ad­vise to blame and de­mol­ish your quan­tum source of ran­dom num­bers in this case.

• Ten­ta­tive, but what about cases where you don’t trust your own judge­ment and sus­pect that you need to be shaken out of your habits?

• What about Monte Carlo meth­ods? There are many prob­lems for which Monte Carlo in­te­gra­tion is the most effi­cient method available.

Monte Carlo meth­ods can’t buy you any cor­rect­ness. They are use­ful be­cause they al­low you to sac­ri­fice an un­nec­es­sary bit of cor­rect­ness in or­der to give you a re­sult in a much shorter time on oth­er­wise in­tractable prob­lem. They are also use­ful to simu­late the effects of real world ran­dom­ness (or at least be­hav­ior you have no idea how to sys­tem­at­i­cally pre­dict).

So, for ex­am­ple, I used a Monte Carlo script to de­ter­mine ex­pected scale economies for print or­der flow in my busi­ness. Why? Be­cause it’s sim­ple and the be­hav­ior I am mod­el­ing is effec­tively ran­dom to me. I could get enough in­for­ma­tion to make a simu­la­tion that gives me 95% ac­cu­racy with a few hours of re­search and an­other few hours of pro­gram­ming time. Of course there is some­where out there a non-ran­dom­ized al­gorithm that could do a more ac­cu­rate job with a faster run time, but the cost of dis­cov­er­ing it and cod­ing it would be far more than a day’s work, and 95% ac­cu­racy on a few dozen simu­la­tions was good enough for me to es­ti­mate more ac­cu­rately than most of my com­pe­ti­tion, which is all that mat­tered. But Eliezer’s point stands. Ran­dom­ness didn’t buy me any ac­cu­racy, it was a way of trad­ing ac­cu­racy for de­vel­op­ment time.

• Con­grat­u­la­tions Eliezer, this sub­ject is very in­ter­est­ing.

Non­the­less use­ful, i.e. Mon­ter Carlo meth­ods with pseudo-ran­dom num­bers ARE ac­tu­ally de­ter­minis­tic, but we dis­card the de­tails (very com­plex and not use­ful for the ap­pli­ca­tion) and only care about some statis­ti­cal prop­er­ties. This is a state of sub­jec­tive par­tial in­for­ma­tion, the busi­ness of bayesian prob­a­bil­ity, only this par­tial in­for­ma­tion is de­liber­ate! (or bet­ter, a prag­matic com­pro­mise due to limi­ta­tions in com­pu­ta­tional power. Other­wise we’d just use clas­si­cal nu­mer­i­cal in­te­gra­tion hap­pily in tens of di­men­sions.)

This bayesian per­spec­tive on “”ran­dom”″ al­gorithms is not some­thing usu­ally found. I think fre­quen­tist in­ter­pre­ta­tions is still per­me­at­ing (and drag­ging) most re­search. In­deed, more Jaynes is needed.

• As stated the state­ment “Ran­dom­ness didn’t buy me any ac­cu­racy, it was a way of trad­ing ac­cu­racy for de­vel­op­ment time.” is a com­plete non-se­quitur since no­body ac­tu­ally be­lieves that ran­dom­ness is a means of in­creased ac­cu­racy in the long term it is used as a perfor­mance booster. It merely means you can get the an­swer faster with less knowl­edge which we do all the time since there are a plethora of sys­tems that we can’t un­der­stand.

• Has any­one proved a the­o­rem on the use­less­ness of ran­dom­ness?

Clearly you don’t rec­og­nize the sig­nifi­cance of Eliezer’s work. He can­not be bound by such triv­ial­ities as ‘proof’ or ‘demon­stra­tion’. They’re not part of the New Ra­tion­al­ity.

• Eliezer, can you math­e­mat­i­cally char­ac­ter­ize those en­vi­ron­ments where ran­dom­iza­tion can help. Pre­sum­ably these are the “much more in­tel­li­gent than you, and out to get you” en­vi­ron­ments, but I’d like to see that char­ac­ter­ized a bit bet­ter.

• A slight tan­gent here, but Blum loses me even be­fore the con­sid­er­a­tions that Eliezer has so clearly pointed out. Com­par­ing an up­per bound for the worst case of one al­gorithm with an up­per bound for the av­er­age case of the other is mean­ingless. Even were like things be­ing com­pared, mere up­per bounds of the true val­ues are only weak ev­i­dence of the qual­ity of the al­gorithms.

FWIW, I ran a few simu­la­tions (for the case of com­plete in­de­pen­dence of all ex­perts from them­selves and each other), and un­sur­pris­ingly, the ran­domised al­gorithm performed on av­er­age worse than the un­ran­domised one. In ad­di­tion, the weight­ing al­gorithm con­verges to plac­ing all of the weight on the best ex­pert. If the ex­perts’ er­rors are at all in­de­pen­dent, this is sub­op­ti­mal.

Ac­tual perfor­mance of both al­gorithms was far bet­ter than the up­per bounds, and rather worse than one us­ing op­ti­mal weights. I can­not see these up­per bounds as any­thing more than a the­o­ret­i­cal ir­rele­vance.

• Any­one who eval­u­ates the perfor­mance of an al­gorithm by test­ing it with ran­dom data (e.g., simu­lat­ing these ex­pert-com­bin­ing al­gorithms with ran­domly-erring “ex­perts”) is ipso facto ex­e­cut­ing a ran­dom­ized al­gorithm...

• g: In­deed I was. Easier than perform­ing an an­a­lyt­i­cal calcu­la­tion and suffi­ciently rigor­ous for the pre­sent pur­pose. (The op­ti­mal weights, on the other hand, are log(p/​(1-p)), by an ar­gu­ment that is both eas­ier and more rigor­ous than a simu­la­tion.)

• Richard, I wasn’t sug­gest­ing that there’s any­thing wrong with your run­ning a simu­la­tion, I just thought it was amus­ing in this par­tic­u­lar con­text.

• Silas—yeah, that’s about the size of it.

Eliezer, when you come to edit this for pop­u­lar pub­li­ca­tion, lose the maths, or at least put it in an end­note. You’re good enough at ex­plain­ing con­cepts that if some­one’s go­ing to get it, they’re go­ing to get it with­out the equa­tions. How­ever, a num­ber of those peo­ple will switch off there and then. I skipped it and did fine, but alge­bra is a stop sign for a num­ber of very in­tel­li­gent peo­ple I know.

• I am in­ter­ested in what Scott Aaron­son says to this.

I fear Eliezer will get an­noyed with me again :), but R and Stephen ba­si­cally nailed it. Ran­dom­ness prov­ably never helps in av­er­age-case com­plex­ity (i.e., where you fix the prob­a­bil­ity dis­tri­bu­tion over in­puts) -- since given any en­sem­ble of strate­gies, by con­vex­ity there must be at least one de­ter­minis­tic strat­egy in the en­sem­ble that does at least as well as the av­er­age.

On the other hand, if you care about the worst-case run­ning time, then there are set­tings (such as query com­plex­ity) where ran­dom­ness prov­ably does help. For ex­am­ple, sup­pose you’re given n bits, you’re promised that ei­ther n/​3 or 2n/​3 of the bits are 1′s, and your task is to de­cide which. Any de­ter­minis­tic strat­egy to solve this prob­lem clearly re­quires look­ing at 2n/​3 + 1 of the bits. On the other hand, a ran­dom­ized sam­pling strat­egy only has to look at O(1) bits to suc­ceed with high prob­a­bil­ity.

Whether ran­dom­ness ever helps in worst-case polyno­mial-time com­pu­ta­tion is the P ver­sus BPP ques­tion, which is in the same league as P ver­sus NP. It’s con­jec­tured that P=BPP (i.e., ran­dom­ness never saves more than a polyno­mial). This is known to be true if re­ally good pseu­do­ran­dom gen­er­a­tors ex­ist, and such PRG’s can be con­structed if cer­tain prob­lems that seem to re­quire ex­po­nen­tially large cir­cuits, re­ally do re­quire them (see this pa­per by Im­pagli­azzo and Wigder­son). But we don’t seem close to prov­ing P=BPP un­con­di­tion­ally.

• Just be­cause you re­fuse to re­search prob­a­bil­is­tic/​ran­dom al­gorithms doesn’t mean they don’t work.

Do some back­ground work for once.

• @ Scott Aaron­son. Re: your n-bits prob­lem. You’re mov­ing the goal­posts. Your de­ter­minis­tic al­gorithm de­ter­mines with 100% ac­cu­racy which situ­a­tion is true. Your ran­dom­ized al­gorithm only de­ter­mines with “high prob­a­bil­ity” which situ­a­tion is true. Th­ese are not the same out­puts.

You need to es­tab­lish goal with a fixed level of prob­a­bil­ity for the an­swer, and then com­pare a ran­dom­ized al­gorithm to a de­ter­minis­tic al­gorithm that only an­swers to that same level of con­fi­dence.

That’s the same mis­take that ev­ery­one always makes, when they say that “ran­dom­ness prov­ably does help.” It’s a cheaper way to solve a differ­ent goal. Hence, not com­pa­rable.

• Scott, I don’t dis­pute what you say. I just sug­gest that the con­fus­ing term “in the worst case” be re­placed by the more ac­cu­rate phrase “sup­pos­ing that the en­vi­ron­ment is an ad­ver­sar­ial su­per­in­tel­li­gence who can perfectly read all of your mind ex­cept bits des­ig­nated ‘ran­dom’”.

• Don: When you fix the goal­posts, make sure some­one can’t kick the ball straight in! :-) Sup­pose you’re given an n-bit string, and you’re promised that ex­actly n/​4 of the bits are 1, and they’re ei­ther all in the left half of the string or all in the right half. The prob­lem is to de­cide which. It’s clear that any de­ter­minis­tic al­gorithm needs to ex­am­ine at least n/​4 + 1 of the bits to solve this prob­lem. On the other hand, a ran­dom­ized sam­pling al­gorithm can solve the prob­lem with cer­tainty af­ter look­ing at only O(1) bits on av­er­age.

Eliezer: I of­ten tell peo­ple that the­o­ret­i­cal com­puter sci­ence is ba­si­cally math­e­mat­i­cized para­noia, and that this is the rea­son why Is­raelis so dom­i­nate the field. You’re ab­solutely right: we do typ­i­cally as­sume the en­vi­ron­ment is an ad­ver­sar­ial su­per­in­tel­li­gence. But that’s not be­cause we liter­ally think it is one, it’s be­cause we don’t pre­sume to know which dis­tri­bu­tion over in­puts the en­vi­ron­ment is go­ing to throw at us. (That is, we lack the self-con­fi­dence to im­pose any par­tic­u­lar prior on the in­puts.) We do of­ten as­sume that, if we gen­er­ate ran­dom bits our­selves, then the en­vi­ron­ment isn’t go­ing to mag­i­cally take those bits into ac­count when de­cid­ing which in­put to throw at us. (In­deed, if we like, we can eas­ily gen­er­ate the ran­dom bits af­ter see­ing the in­put—not that it should make a differ­ence.)

Aver­age-case anal­y­sis is also well-es­tab­lished and used a great deal. But in those cases where you can solve a prob­lem with­out hav­ing to as­sume a par­tic­u­lar dis­tri­bu­tion over in­puts, why com­pli­cate things un­nec­es­sar­ily by mak­ing such an as­sump­tion? Who needs the risk?

• EDIT: OK, I found the ex­pla­na­tion fur­ther down, seems I was on the right path.

It’s clear that any de­ter­minis­tic al­gorithm needs to ex­am­ine at least n/​4 + 1 of the bits to solve this prob­lem. [...] On the other hand, a ran­dom­ized sam­pling al­gorithm can solve the prob­lem with cer­tainty af­ter look­ing at only O(1) bits on av­er­age.

Sorry Scott, I know I’m late to the party, but this got me cu­ri­ous. Can you ex­plain this, or point me to an ex­pla­na­tion? I’m not sure I got the cor­rect num­bers.

As far as I can tell, a rea­son­able de­ter­minis­tic al­gorithm is to ex­am­ine, one by one, at most the (n/​4 + 1) bits (e.g., the first); but if you find a 1 early, you stop. The non­de­ter­minis­tic one would be to just test bits in ran­dom or­der un­til you see a 1.

For the de­ter­minis­tic ver­sion, in half the cases you’re go­ing to eval­u­ate all those bits when they’re in the “end” half.

If the bits are in the “front” half, or if the al­gorithm is ran­dom, I can’t figure out ex­actly how to solve the av­er­age num­ber of bits to test, but it seems like the de­ter­minis­tic half should do on av­er­age about as well* as the non­de­ter­minis­tic one on a prob­lem half the size, and in the worst case sce­nario it still does (n/​4 + 1) bits, while the non­de­ter­minis­tic al­gorithm has to do (n/​2 + 1) tests in the worst-case sce­nario un­less I’m mis­taken.

I mean, it does seem like the prob­a­bil­ity of hit­ting a bad case is low on av­er­age, but the worst case seems twice as bad, even though harder to hit.

I’m not sure how you got O(1); I got to a sum of re­cip­ro­cals of squares up to k for the prob­a­bil­ity of the first 1 bit be­ing the k-th tested, which sounds about O(1) on av­er­age, but my prob­a­bil­ities are rusty, is that the way?

(* of course, it’s easy to con­struct the bad case for the de­ter­minis­tic al­gorithm if you know which bits it tests first.)

• the en­vi­ron­ment is an ad­ver­sar­ial superintelligence

Oh, Y’all are try­ing to out­wit Mur­phy? No won­der you’re screwed.

• Could Scott_Aaron­son or any­one who knows what he’s talk­ing about, please tell me the name of the n/​4 left/​right bits prob­lem he’s refer­ring to, or oth­er­wise give me a refer­ence for it? His ex­pla­na­tion doesn’t seem to make sense: the de­ter­minis­tic al­gorithm needs to ex­am­ine 1+n/​4 bits only in the worst case, so you can’t com­pare that to the av­er­age out­put of the ran­dom al­gorithm. (Aver­age case for the de­ter­mistic would, it seems, be n/​8 + 1) Fur­ther­more, I don’t un­der­stand how the ran­dom method could av­er­age out to a size-in­de­pen­dent con­stant.

Is the ran­dom­ized al­gorithm one that uses a quan­tum com­puter or some­thing?

• Silas: “Solve” = for a worst-case string (in both the de­ter­minis­tic and ran­dom­ized cases). In the ran­dom­ized case, just keep pick­ing ran­dom bits and query­ing them. After O(1) queries, with high prob­a­bil­ity you’ll have queried ei­ther a 1 in the left half or a 1 in the right half, at which point you’re done.

As far as I know this prob­lem doesn’t have a name. But it’s how (for ex­am­ple) you con­struct an or­a­cle sep­a­rat­ing P from ZPP.

• @Scott_Aaron­son: Pre­vi­ously, you had said the prob­lem is solved with cer­tainty af­ter O(1) queries (which you had to, to satisfy the ob­jec­tion). Now, you’re say­ing that af­ter O(1) queries, it’s merely a “high prob­a­bil­ity”. Did you not change which claim you were defend­ing?

Se­cond, how can the re­quired num­ber of queries not de­pend on the prob­lem size?

Fi­nally, isn’t your ex­am­ple a spe­cial case of ex­actly the situ­a­tion Eliezer_Yud­kowsky de­scribes in this post? In it, he pointed out that the “worst case” cor­re­sponds to an ad­ver­sary who knows your al­gorithm. But if you speci­fi­cally ex­clude that pos­si­bil­ity, then a de­ter­minis­tic al­gorithm is just as good as the ran­dom one, be­cause it would have the same cor­re­la­tion with a ran­domly cho­sen string. (It’s just like the case in the lock­pick­ing prob­lem: guess­ing all the se­quences in or­der has no ad­van­tage over ran­domly pick­ing and cross­ing off your list.) The ap­par­ent suc­cess of ran­dom­ness is again due to, “act­ing so crazy that a su­per­in­tel­li­gent op­po­nent can’t pre­dict you”.

Which is why I sum­ma­rize Eliezer_Yud­kowsky’s po­si­tion as: “Ran­dom­ness is like poi­son. Yes, it can benefit you, but only if you use it on oth­ers.”

• Silas: Look, as soon you find a 1 bit, you’ve solved the prob­lem with cer­tainty. You have no re­main­ing doubt about the an­swer. And the ex­pected num­ber of queries un­til you find a 1 bit is O(1) (or 2 to be ex­act). Why? Be­cause with each query, your prob­a­bil­ity of find­ing a 1 bit is 12. There­fore, the prob­a­bil­ity that you need t or more queries is (1/​2)^(t-1). So you get a ge­o­met­ric se­ries that sums to 2.

(Note: I was care­ful to say you suc­ceed with cer­tainty af­ter an ex­pected num­ber of queries in­de­pen­dent of n—not that there’s a con­stant c such that you suc­ceed with cer­tainty af­ter c queries, which would be a differ­ent claim!)

And yes, I ab­solutely as­sume that the ad­ver­sary knows the al­gorithm, be­cause your al­gorithm is a fixed piece of in­for­ma­tion! Once again, the point is not that the uni­verse is out to get us—it’s that we’d like to suc­ceed even if it is out to get us. In par­tic­u­lar, we don’t want to as­sume that we know the prob­a­bil­ity dis­tri­bu­tion over in­puts, and whether it might hap­pen to be es­pe­cially bad for our al­gorithm. Ran­dom­ness is use­ful pre­cisely for “smear­ing out” over the set of pos­si­ble en­vi­ron­ments, when you’re com­pletely ig­no­rant about the en­vi­ron­ment.

If you’re not to think this way, the price you pay for Bayesian pu­rity is to give up whole the­ory of ran­dom­ized al­gorithms, with all the ad­vances it’s led to even in de­ter­minis­tic al­gorithms; and to lack the con­cep­tual tools even to ask ba­sic ques­tions like P ver­sus BPP.

It feels strange to be ex­plain­ing CS101 as if I were defend­ing some em­bat­tled ide­olog­i­cal claim—but it’s good to be re­minded why it took peo­ple a long time to get these things right.

• Quick check: take al­gorithm that tests bits 1, n/​2+1, 3, n/​2+3, 5, n/​2+5 … then con­tinues al­ter­nat­ing halves on even in­dices. Of course, it stops at the first 1. And it’s ob­vi­ously de­ter­minis­tic.

But un­less I’m mis­taken, av­er­aged over all pos­si­ble in­stances of the prob­lem, this al­gorithm has ex­actly the same av­er­age, best case and worst case as the non­de­ter­minis­tic one. (In fact, I think this holds for any fixed se­quence of test in­dices, as long as it al­ter­nates be­tween halves, such as 1, n, 2, n-1, 3, n-2… Ba­si­cally, my rea­son­ing is that an ar­bi­trary in­dex se­quence should have (on av­er­age) the same prob­a­bil­ity of find­ing the an­swer in k tests on a ran­dom prob­lem that a ran­dom in­dex se­quence does.)

I know you said it’s ad­ver­sar­ial, and it’s clear that (as­sum­ing you know the se­quence of tested in­dices) you can con­struct prob­lem in­stances that take n/​2+1 (worst-case) tests, in which case the prob­lem isn’t ran­dom. I just want to test if I rea­soned cor­rectly.

If I’m right, then I think what you’re say­ing is ex­actly what Eliezer said: If you know the ad­ver­sary can pre­dict you, you’ve got no bet­ter op­tion than to act ran­domly. (You just don’t do as bad in this case as in Eliezer’s ex­am­ple.) Con­versely, if you can pre­dict the ad­ver­sary, then the ran­dom al­gorithm com­pletely wastes this knowl­edge. (I.e., if you could guess more about the se­quence, you could prob­a­bly do bet­ter than n/​4+1 worst case, and if you can pre­dict the ad­ver­sary then you don’t even need to test the bits, which is ba­si­cally O(0).)

• Isn’t the prob­a­bil­ity of find­ing a bit 14 in each sam­ple? Be­cause you have to sam­ple both sides. If you ran­domly sam­ple with­out rep­e­ti­tions you can do even bet­ter than that. You can even bound the worse case for the ran­dom al­gorithm at n/​4+1 as well by see­ing if you ever pick n/​4+1 0s from any side. So there is no down­side to do­ing it ran­domly as such.

How­ever if it is an im­por­tant prob­lem and you think you might be able to find some reg­u­lar­i­ties, the best bet would to be to do bayesian up­dates on the most likely po­si­tions to be ones and prefer­en­tially choose those. You might be bit­ten by a su­per in­tel­li­gence pre­dict­ing what you think most prob­a­ble, and do­ing the op­po­site, but as you have made the sys­tem more com­plex the world has to be more com­plex to out­smart you, which may be less likely. You can even check this by see­ing whether you are hav­ing to do n/​4+1 sam­ples or close to it on av­er­age.

• Will: Yeah, it’s 14, thanks. I some­how have a blind spot when it comes to con­stants. ;-)

• Let’s say the solver has two in­put tapes: a “prob­lem in­stance” in­put and a “noise source” in­put.

When Eli says “worst-case time”, he means the max­i­mum time, over all prob­lem in­stances and all noise sources, taken to solve the prob­lem.

But when Scott says “worst-case time,” he means tak­ing the max­i­mum (over all prob­lem in­stances X) of the av­er­age time (over all noise sources) to solve X.

Scott’s crite­rion seems rele­vant to real-world situ­a­tions where one’s in­put is be­ing gen­er­ated by an ad­ver­sar­ial hu­man who knows which open-source soft­ware you’re us­ing but doesn’t know your ran­dom seed (which was per­haps gen­er­ated by mov­ing your mouse around). Here I don’t think Eli and Scott have any real dis­agree­ment, since Eli agrees that ran­dom­ness can be use­ful in defeat­ing an ad­ver­sar­ial in­tel­li­gence.

BTW, I think I have a situ­a­tion where ran­dom­ness is use­ful with­out pos­tu­lat­ing an ad­ver­sary (you could ar­gue that the ad­ver­sary is still pre­sent though).

Say we’re quick­sort­ing a very long list of length N, your worst case is pretty bad, and your best case is not much bet­ter than av­er­age. You have a Kol­mogorov dis­tri­bu­tion for your in­put, and you have a long string which you be­lieve to be max­i­mally com­pressed (ie, it’s re­ally re­ally ran­dom). If your quick­sort code is short enough com­pared to N, then the worst-case in­put will have com­plex­ity << N, since it can be con­cisely de­scribed as the worst-case in­put for your sort­ing al­gorithm. If you use your ran­dom string to per­mute the list, then it will (prob­a­bly) no longer have an easy-to-spec­ify worst-case in­put, so your av­er­age case may im­prove.

• It sounds to me like Peter is sug­gest­ing to ran­domly per­mute all in­puts to the sort­ing al­gorithm, in hopes of im­prov­ing the av­er­age case run­time. The jus­tifi­ca­tion ap­pears to be that the worst case has low com­plex­ity be­cause it’s easy to de­scribe, be­cause you can de­scribe it by say­ing “it’s the worst case for ”.

For ex­am­ple, the worst-case in­put for Quick­sort is an already-sorted list. You could, in­deed, get bet­ter av­er­age perfor­mance out of quick­sort if you could ran­domly per­mute its in­puts in zero time, be­cause code is more likely to pro­duce already-sorted lists than lists sorted, say, by { x mod 17 }.

My ini­tial re­ac­tion was that the im­plicit de­scrip­tion “that in­put which is worst case for this al­gorithm” might spec­ify an in­put which was not un­usu­ally likely to oc­cur in prac­tice. Peter is say­ing that it is more likely to oc­cur in prac­tice, even if it’s an im­plicit de­scrip­tion—even if we don’t know what the worst-case in­put is—be­cause it has lower com­plex­ity in an ab­solute sense, and so more pro­cesses in the real world will gen­er­ate that in­put than one with a higher com­plex­ity.

Is that right?

And no fast de­ter­minis­tic per­mu­ta­tion can be used in­stead, be­cause the de­ter­minis­tic per­mu­ta­tion could be speci­fied con­cisely.

In prac­tice, our ran­dom num­ber gen­er­a­tors use a seed of 16 to 32 bits, so they in­tro­duce al­most no com­plex­ity at all by this crite­rion. No RNG can be use­ful for this pur­pose, for the same rea­son no de­ter­minis­tic per­mu­ta­tion can be use­ful.

Still, I think this is a con­clu­sive proof that ran­dom­ness can be use­ful.

• That last para­graph is pretty un­set­tling—you get pe­nal­ized for not throw­ing a messy obfus­ca­tory step into your­self? A Kol­mogorov dis­tri­bu­tion in­deed does seem to be a “pre­sent ad­ver­sary”!

• @michael e sul­li­van: re “Monte Carlo meth­ods can’t buy you any cor­rect­ness”—ac­tu­ally, they can. If you have an ex­act closed-form solu­tion (or a rapidly-con­verg­ing se­ries, or what­ever) for your num­bers, you re­ally want to use it. How­ever not all prob­lems have such a thing; gen­er­ally, you ei­ther sim­plify (giv­ing a pre­cise, in­cor­rect num­ber that is read­ily com­putable and hope­fully close), or you can do a nu­mer­i­cal eval­u­a­tion, which might ap­proach the cor­rect solu­tion ar­bi­trar­ily closely based on how much com­pu­ta­tion power you de­vote to it.

Quadra­ture (the straight­for­ward way to do nu­mer­i­cal in­te­gra­tion us­ing reg­u­larly-spaced sam­ples) is a nu­meric eval­u­a­tion method which is effi­cient for smooth, low-di­men­sional prob­lems. How­ever, for higher-di­men­sional prob­lems, the num­ber of sam­ples be­comes im­prac­ti­cal. For such difficult prob­lems, Monte Carlo in­te­gra­tion ac­tu­ally con­verges faster, and can some­times be the only fea­si­ble method.

Some­what iron­i­cally, one field where Monte Carlo buys you cor­rect­ness is nu­meric eval­u­a­tion of Bayesian statis­ti­cal mod­els!

• Silas is right; Scott keeps chang­ing the defi­ni­tion in the mid­dle, which was ex­actly my origi­nal com­plaint.

For ex­am­ple, Scott says: In the ran­dom­ized case, just keep pick­ing ran­dom bits and query­ing them. After O(1) queries, with high prob­a­bil­ity you’ll have queried ei­ther a 1 in the left half or a 1 in the right half, at which point you’re done.

And yet this is no differ­ent from a de­ter­minis­tic al­gorithm. It can also query O(1) bits, and “with high prob­a­bil­ity” have a cer­tain an­swer.

I’m re­ally as­ton­ished that Scott can’t see the slight-of-hand in his own state­ment. Here’s how he ex­presses the challenge: It’s clear that any de­ter­minis­tic al­gorithm needs to ex­am­ine at least n/​4 + 1 of the bits to solve this prob­lem. On the other hand, a ran­dom­ized sam­pling al­gorithm can solve the prob­lem with cer­tainty af­ter look­ing at only O(1) bits on av­er­age.

No­tice that tricky fi­nal phrase, on av­er­age? That’s vastly weaker than what he is forc­ing the de­ter­minis­tic al­gorithm to do. The “proof” that a de­ter­minis­tic al­gorithm re­quires n/​4+1 queries re­quires a goal much more difficult than get­ting a quick an­swer “on av­er­age”.

If you’re will­ing to con­sider a goal of an­swer­ing quickly only “on av­er­age”, then a de­ter­minis­tic al­gorithm can also do it just as fast (on av­er­age) as your ran­dom­ized al­gorithm.

• Don how well a de­ter­minis­tic al­gorithm does de­pends upon what the de­ter­minis­tic al­gorithm is and what the in­put is, it can­not always query O(1) queries.

E.g. in a 20 bit case

If the de­ter­minis­tic al­gorithm starts its queries from the be­gin­ning, query­ing each bit in turn and this is pat­tern is always 00000111110000000000

It will always take n/​4+1 queries. Only when the in­put is ran­dom will it on av­er­age take O(1) queries.

The ran­dom one will take O(1) on av­er­age on ev­ery type of in­put.

• Don—the “sleight of hand”, as you put it is that (as I think has been said be­fore) we’re ex­am­in­ing worst-case. We ex­plic­itly are al­low­ing the string of 1s and 0s to be picked by an ad­ver­sar­ial su­per­in­tel­li­gence who knows our strat­egy. Scott’s al­gorithm only needs to sam­ple 4 bits on av­er­age to an­swer the ques­tion even when the uni­verse it out to get him—the de­ter­minis­tic ver­sion will need to sam­ple ex­actly n/​4+1.

Ba­si­cally, Eliezer seems to be claiming that BPP = P (or pos­si­bly some­thing stronger), which most peo­ple think is prob­a­bly true, but, as has been said, no-one can prove. I for one ac­cept his in­tu­itive ar­gu­ments as, I think, do most peo­ple who’ve thought about it, but prov­ing this in­tu­ition rigor­ously is a ma­jor out­stand­ing prob­lem in com­pu­ta­tional com­plex­ity.

My su­per­v­ior’s in­tu­itive ar­gu­ment for why you can never get any­thing from ran­dom­ness is that in any par­tic­u­lar case where ran­dom­ness ap­pears to help you can just pick a pseudo-ran­dom source which is “ran­dom enough” (pre­sum­ably a for­mal­i­sa­tion of this in­tu­ition is what Scott’s talk­ing about when he says that BPP = P if “re­aly good pseu­do­ran­dom gen­er­a­tors ex­ist”).

• To gen­er­al­ize Peter’s ex­am­ple, a typ­i­cal de­ter­minis­tic al­gorithm has low Kol­mogorov com­plex­ity, and there­fore its worst-case in­put also has low Kol­mogorov com­plex­ity and there­fore a non-neg­ligible prob­a­bil­ity un­der com­plex­ity-based pri­ors. The only pos­si­ble solu­tions to this prob­lem I can see are:

2. re­design the de­ter­minis­tic al­gorithm so that it has no worst-case in­put
3. do a cost-benefit anal­y­sis to show that the cost of do­ing ei­ther 1 or 2 is not jus­tified by the ex­pected util­ity of avoid­ing the worst-case perfor­mance of the origi­nal al­gorithm, then con­tinue to use the origi­nal de­ter­minis­tic algorithm

The main ar­gu­ment in fa­vor of 1 is that its cost is typ­i­cally very low, so why bother with 2 or 3? I think Eliezer’s coun­ter­ar­gu­ment is that 1 only works if we as­sume that in ad­di­tion to the in­put string, the al­gorithm has ac­cess to a truly ran­dom string with a uniform dis­tri­bu­tion, but in re­al­ity we only have ac­cess to one in­put, i.e., sen­sory in­put from the en­vi­ron­ment, and the so called ran­dom bits are just bits from the en­vi­ron­ment that seem to be ran­dom.

My counter-coun­ter­ar­gu­ment is to con­sider ran­dom­iza­tion as a form of di­vi­sion of la­bor. We use one very com­plex and so­phis­ti­cated al­gorithm to put a lower bound on the Kol­mogorov com­plex­ity of a source of ran­dom­ness in the en­vi­ron­ment, then af­ter that, this source of ran­dom­ness can be used by many other sim­pler al­gorithms to let them cheaply and dra­mat­i­cally re­duce the prob­a­bil­ity of hit­ting a worst-case in­put.

Or to put it an­other way, be­fore ran­dom­iza­tion, the en­vi­ron­ment does not need to be a mal­i­cious su­per­in­tel­li­gence for our al­gorithms to hit worst-case in­puts. After ran­dom­iza­tion, it does.

• Or to put it an­other way, be­fore ran­dom­iza­tion, the en­vi­ron­ment does not need to be a mal­i­cious su­per­in­tel­li­gence for our al­gorithms to hit worst-case in­puts. After ran­dom­iza­tion, it does.

(I agree that this is one among many valid cases of when we might want to just throw in ran­dom­iza­tion to save think­ing time, as op­posed to do­ing a de­tailed de­ran­dom­ized anal­y­sis.)

• Quan­tum branch­ing is “truly ran­dom” in the sense that branch­ing the uni­verse both ways is an ir­re­ducible source of new in­dex­i­cal ig­no­rance. But the more im­por­tant point is that un­less the en­vi­ron­ment re­ally is out to get you, you might be wiser to ex­ploit the ways that you think the en­vi­ron­ment might de­part from true ran­dom­ness, rather than us­ing a noise source to throw away this knowl­edge.

• I’m not mak­ing any non­stan­dard claims about com­pu­ta­tional com­plex­ity, only sid­ing with the minor­ity “de­ran­dom­ize things” crowd

Note that I also en­thu­si­as­ti­cally be­long to a “de­ran­dom­ize things” crowd! The differ­ence is, I think de­ran­dom­iz­ing is hard work (some­times pos­si­ble and some­times not), since I’m un­will­ing to treat the ran­dom­ness of the prob­lems the world throws at me on the same foot­ing as ran­dom­ness I gen­er­ate my­self in the course of solv­ing those prob­lems. (For those watch­ing at home tonight, I hope the differ­ences are now rea­son­ably clear...)

• I cer­tainly don’t say “it’s not hard work”, and the en­vi­ron­men­tal prob­a­bil­ity dis­tri­bu­tion should not look like the prob­a­bil­ity dis­tri­bu­tion you have over your ran­dom num­bers—it should con­tain cor­re­la­tions and struc­ture. But once you know what your prob­a­bil­ity dis­tri­bu­tion is, then you should do your work rel­a­tive to that, rather than as­sum­ing “worst case”. Op­ti­miz­ing for the worst case in en­vi­ron­ments that aren’t ac­tu­ally ad­ver­sar­ial, makes even less sense than as­sum­ing the en­vi­ron­ment is as ran­dom and un­struc­tured as ther­mal noise.

I would defend the fol­low­ing sort of state­ment: While of­ten it’s not worth the com­put­ing power to take ad­van­tage of all the be­lieved-in reg­u­lar­ity of your prob­a­bil­ity dis­tri­bu­tion over the en­vi­ron­ment, any en­vi­ron­ment that you can’t get away with treat­ing as effec­tively ran­dom, prob­a­bly has enough struc­ture to be worth ex­ploit­ing in­stead of ran­dom­iz­ing.

(This isn’t based on ca­reer ex­pe­rience, it’s how I would state my ex­pec­ta­tion given my prior the­ory.)

• once you know what your prob­a­bil­ity dis­tri­bu­tion is...

I’d merely stress that that’s an enor­mous “once.” When you’re writ­ing a pro­gram (which, yes, I used to do), nor­mally you have only the fog­giest idea of what a typ­i­cal in­put is go­ing to be, yet you want the pro­gram to work any­way. This is not just a hy­po­thet­i­cal worry, or some­thing limited to cryp­tog­ra­phy: peo­ple have ac­tu­ally run into strange prob­lems us­ing pseu­do­ran­dom gen­er­a­tors for Monte Carlo simu­la­tions and hash­ing (see here for ex­am­ple, or Knuth vol 2).

Even so, in­tu­ition sug­gests it should be pos­si­ble to de­sign PRG’s that defeat any­thing the world is likely to throw at them. I share that in­tu­ition; it’s the ba­sis for the (yet-un­proved) P=BPP con­jec­ture.

“Any one who con­sid­ers ar­ith­meti­cal meth­ods of pro­duc­ing ran­dom digits is, of course, in a state of sin.”—von Neumann

• Even if P=BPP, that just means that giv­ing up ran­dom­ness causes “only” a polyno­mial slow­down in­stead of an ex­po­nen­tial one, and in prac­tice we’ll still need to use pseu­do­ran­dom gen­er­a­tors to simu­late ran­dom­ness.

It seems clear to me that noise (in the sense of ran­dom­ized al­gorithms) does have power, but per­haps we need to de­velop bet­ter in­tu­itions as to why that is the case.

• Eliezer: There are a few classes of situ­a­tions I can think of in which it seems like the cor­rect solu­tion (given cer­tain re­stric­tions) re­quires a source of ran­dom­ness:

One ex­am­ple would be Hofs­tadter’s Lur­ing Lot­tery. As­sum­ing all the agents re­ally did have com­mon knowl­edge of each other’s ra­tio­nal­ity, and as­sum­ing no form of com­mu­ni­ca­tion is al­lowed be­tween them in any way, isn’t it true that no de­ter­minis­tic al­gorithm to de­cide how many en­tries to send in, if any at all, has a higher ex­pected re­turn than a ran­dom­ized one? (that is, the solu­tions which are bet­ter are ran­dom­ized ones, rather than ran­dom­ized ones au­to­mat­i­cally be­ing bet­ter solu­tions)

The rea­son be­ing that you’ve got a pris­oner’s dilemma type situ­a­tion there with all the agents us­ing the same de­ci­sion al­gorithm. So any de­ter­minis­tic al­gorithm means they all do the same thing, which means many en­tries are sent in, which means the pot shrinks, while not at all boost­ing any in­di­vi­d­ual’s agent’s ex­pected win­nings.

• The lead­ers in the Net­flixPrize com­pe­ti­tion for the last cou­ple years have uti­lized en­sem­bles of large num­bers of mod­els with a fairly straight­for­ward in­te­gra­tion pro­ce­dure. You can only get so far with a given model, but if you ran­domly scram­ble its hy­per­pa­ram­e­ters or train­ing pro­ce­dure, and then av­er­age mul­ti­ple runs to­gether, you will im­prove your perfor­mance. The log­i­cal path for­ward is to de­ran­dom­ize this pro­ce­dure, and figure out how to pre­dict, a pri­ori, which model prob­a­bil­ities be­come more ac­cu­rate and which don’t. But of course un­til you figure out how to do that, ran­dom is bet­ter than noth­ing.

As a pro­cess method­ol­ogy, it seems use­ful to try ran­dom vari­a­tions, find the ones that out­perform and THEN seek to ex­plain it.

• @ Will: You hap­pen to have named a par­tic­u­lar de­ter­minis­tic al­gorithm; that doesn’t say much about ev­ery de­ter­minis­tic al­gorithm. More­over, none of you folks seem to no­tice much that pseudo-ran­dom al­gorithms are ac­tu­ally de­ter­minis­tic too...

Fi­nally, you say: “Only when the in­put is ran­dom will [the de­ter­minis­tic al­gorithm] on av­er­age take O(1) queries. The ran­dom one will take O(1) on av­er­age on ev­ery type of in­put.”

I can’t tell you how frus­trat­ing it is to see you just con­tinue to toss in the “on av­er­age” phrase, as though it doesn’t mat­ter, when in fact it’s the crit­i­cal part.

To be blunt: you have no proof that all de­ter­minis­tic al­gorithms re­quire n/​4+1 steps on av­er­age. You only have a proof that de­ter­minis­tic al­gorithms re­quire n/​4+1 steps in the worst case, full stop. Not, “in the worse case, on av­er­age”.

• DonGed­dis is miss­ing the point. Ran­dom­ness does buy power. The sim­plest ex­am­ple is sam­pling in statis­tics: to es­ti­mate the frac­tion of read-headed peo­ple in the world to within 5% pre­ci­sion and with con­fi­dence 99%, it is enough to draw a cer­tain num­ber of ran­dom sam­ples and com­pute the frac­tion of read-headed peo­ple in the sam­ple. The num­ber of sam­ples re­quired is in­de­pen­dent of the pop­u­la­tion size. No de­ter­minis­tic al­gorithm can do this, not even if you try to simu­late sam­ples by us­ing good PRNGs, be­cause there are ``or­der­ings″ of the world’s pop­u­la­tion that would fool your al­gorithm. But if you use ran­dom bits that are in­de­pen­dent of the data, you can­not be fooled. (By the way, whether ‘true ran­dom­ness’ is re­ally pos­si­ble in the real world is a to­tally differ­ent is­sue.)

Note that, for de­ci­sion prob­lems, ran­dom­ness is not be­lieved to give you a more-than-polyno­mial speedup (this is the P vs BPP ques­tion). But nei­ther sam­pling nor the promise prob­lem Scott men­tioned are de­ci­sion prob­lems (defined on all in­puts).

Re­gard­ing your claim that you can­not com­pare ‘worst-case time’ for de­ter­minis­tic al­gorithms with ‘worst-case ex­pected time’ for ran­dom­ized ones: this is to­tally pointless and shows a to­tal lack of un­der­stand­ing of ran­dom­iza­tion. No one claims that you can have a ran­dom­ized al­gorithm with suc­cess prob­a­bil­ity one that out­performs the de­ter­minis­tic al­gorithm. So ei­ther we use the ex­pected run­ning time or the ran­dom­ized al­gorithm, or we al­low a small er­ror prob­a­bil­ity.

We can choose ei­ther one and use the same defi­ni­tion of the prob­lem for both de­ter­minis­tic and ran­dom­ized. Take Scott ex­am­ple and sup­pose we want al­gorithms that suc­ceed with prob­a­bil­ity 23 at least, on ev­ery in­put. The goal is the same in both cases, so they are com­pa­rable. In the case of de­ter­minis­tic al­gorithms, this im­plies always get­ting the cor­rect an­swer. DonGed­dis seems not to re­al­ize this, and con­trary to what he says, any de­ter­minis­tic al­gorithm, no mat­ter how clever, needs to look at n/​4+1 el­e­ments in the worst case. (By the way, maybe DonGed­dis prob­lem is that he doesn’t see the proof of this fact, but it eas­ily prov­able with stan­dard tech­niques.) How­ever, a ran­dom­ized al­gorithm that makes a worst-case con­stant num­ber of queries will suc­ceed with prob­a­bil­ity 23. If we want higher suc­ceed prob­a­bil­ity, we can just re­peat a con­stant num­ber of times. Note that I’m not talk­ing about the ex­pected com­plex­ity but about worst-case com­plex­ity, if the former both­ers you; this comes at the ex­pense of a small er­ror prob­a­bil­ity.

As for Blum’s pa­per, it shows that the best bounds we can prove for ran­dom­ized al­gorithms in that par­tic­u­lar prob­lem out­perform the bounds we can prove for de­ter­minis­tic al­gorithms. Sure, that does not im­ply the former are bet­ter, but it’s the best bound we can prove. To prove that the ran­dom­ized al­gorithm in this case has a bet­ter con­stant, one needs to find a lower bound for the de­ter­minis­tic al­gorithm.

• The num­ber of sam­ples re­quired is in­de­pen­dent of the pop­u­la­tion size. No de­ter­minis­tic al­gorithm can do this, not even if you try to simu­late sam­ples by us­ing good PRNGs, be­cause there are ``or­der­ings″ of the world’s pop­u­la­tion that would fool your al­gorithm.

Wait a minute. There are also or­der­ings of the world pop­u­la­tion that will ran­domly fool the ran­dom al­gorithm. (That’s where the 99% comes from.) Or, more pre­cisely, there are sam­plings gen­er­ated by your ran­dom al­gorithm that get fooled by the ac­tual or­der­ing of the world pop­u­la­tion.

If you write a de­ter­minis­tic al­gorithm that sam­ples the ex­act num­ber of sam­ples as the ran­dom one does, but in­stead of sam­pling ran­domly you sam­ple de­ter­minis­ti­cally, spread­ing the sam­ples in a way you be­lieve to be ap­pro­pri­ately uniform, then, as­sum­ing the hu­man pop­u­la­tion isn’t con­spiring to trick you, you should get ex­actly the same 99% con­fi­dence and 5% pre­ci­sion.

(Of course, if you screw up spread­ing the sam­ples, it means you did worse than ran­dom, so there’s some­thing wrong with your model of the world. And if the en­tire world con­spires to mess up your es­ti­mate of red hair fre­quency, and suc­ceeds to do so with­out you notic­ing and al­ter­ing you sam­pling, you’ve got big­ger prob­lems.)

For ex­am­ple, once n is found (the num­ber of tests needed), you list all hu­mans in or­der of some suit­able tu­ple of prop­er­ties (e.g., full name name, date and time of birth, pop­u­la­tion of home city/​town/​etc, or maybe), and then pick n peo­ple equally dis­tanced in this list. You’re still go­ing to find the an­swer, with the same pre­ci­sion and con­fi­dence, un­less the world pop­u­la­tion con­spired to dis­tribute it­self over the world ex­actly the right way to mess up your se­lec­tion. There are all sorts of other se­lec­tion pro­ce­dures that are de­ter­minis­tic (they de­pend only on the pop­u­la­tion), but are ridicu­lously un­likely to have any cor­re­la­tion with hair color.

Other ex­am­ple: or­der all cities/​towns/​etc by GDP; group them in groups of ~N/​n; pick from each group the youngest per­son; I don’t think n is high enough, but if it hap­pens that a group con­tains a sin­gle city with more than N/​n pop­u­la­tion pick the two, three, etc youngest per­sons. If you’re con­cerned that peo­ple all over the world re­cently started plan­ning their ba­bies just to mess with you, then pick the me­dian-by-age per­son.

• @ John, @ Scott: You’re still do­ing some­thing odd here. As has been men­tioned ear­lier in the com­ments, you’ve imag­ined a mind-read­ing su­per­in­tel­li­gence … ex­cept that it doesn’t get to see the in­ter­nal ran­dom string.

Look, this should be pretty sim­ple. The phrase “worst case” has a pretty clear lay­man’s mean­ing, and there’s no rea­son we need to de­part from it.

You’re go­ing to get your string of N bits. You need to write an al­gorithm to find the 1s. If your al­gorithm ever gives the wrong an­swer, we’re go­ing to shoot you in the head with a gun and you die. I can write a de­ter­minis­tic al­gorithm that will do this in at most n/​4+1 steps. So we’ll run it on a com­puter that will ex­e­cute at most n/​4+1 queries of the in­put string, and oth­er­wise just halt (with some fixed an­swer). We can run this trillions of times, and I’m never get­ting shot in the head.

Now, you have a pro­posal. You need one ad­di­tional thing: a source of ran­dom bits, as an ad­di­tional in­put to your new al­gorithm. Fine, granted. Now we’re go­ing to point the gun at your head, and run your al­gorithm trillions of times (against ran­dom in­puts). I was only able to write a de­ter­minis­tic al­gorithm; you have the abil­ity to write a ran­dom­ized al­gorithm. Ap­par­ently you think this gives you more power.

Now then, the im­por­tant ques­tion: are you will­ing to run your new al­gorithm on a spe­cial com­puter that halts af­ter fewer than n/​4+1 queries of the in­put string? Do you have con­fi­dence that, in the worst case, your al­gorithm will never need more than, say, n/​4 queries?

No? Then stop mak­ing false com­par­i­sons be­tween the de­ter­minis­tic and the ran­dom­ized ver­sions.

• To look at it one more time … Scott origi­nally said Sup­pose you’re given an n-bit string, and you’re promised that ex­actly n/​4 of the bits are 1, and they’re ei­ther all in the left half of the string or all in the right half.

So we have a whole set of de­ter­minis­tic al­gorithms for solv­ing the prob­lem over here, and a whole set of ran­dom­ized al­gorithms for solv­ing the same prob­lem. Take the best de­ter­minis­tic al­gorithm, and the best ran­dom­ized al­gorithm.

Some peo­ple want to claim that the best ran­dom­ized al­gorithm is “prov­ably bet­ter”. Really? Bet­ter in what way?

Is it bet­ter in the worst case? No, in the very worst case, any al­gorithm (ran­dom­ized or not) is go­ing to need to look at n/​4+1 bits to get the cor­rect an­swer. Even worse! The ran­dom­ized al­gorithms peo­ple were sug­gest­ing, with low av­er­age com­plex­ity, will—in the very worst case—need to look at more than n/​4+1 bits, just be­cause there are 3/​4n 0 bits, and the al­gorithm might get very, very un­lucky.

OK, so ran­dom­ized al­gorithms are clearly not bet­ter in the worst case. What about the av­er­age case? To be­gin with, no­body here has done any av­er­age case anal­y­sis. But I challenge any of you to prove that ev­ery de­ter­minis­tic al­gorithm on this prob­lem is nec­es­sar­ily worse, on av­er­age, that some (one?) ran­dom­ized al­gorithm. I don’t be­lieve that is the case.

So what do we have left? You had to in­vent a bizarre sce­nario, sup­pos­ing that the en­vi­ron­ment is an ad­ver­sar­ial su­per­in­tel­li­gence who can perfectly read all of your mind ex­cept bits des­ig­nated ‘ran­dom’, as Eliezer say, in or­der to find a situ­a­tion where the ran­dom­ized al­gorithm is prov­ably bet­ter. OK, that proof works, but why is that sce­nario at all in­ter­est­ing to any real-world ap­pli­ca­tion? The real world is never ac­tu­ally in that situ­a­tion, so it’s highly mis­lead­ing to use it as a ba­sis for con­clud­ing that ran­dom­ized al­gorithms are “prov­ably bet­ter”.

No, what you need to do is ar­gue that the pat­tern of queries used by a (or any?) de­ter­minis­tic al­gorithm, is more likely to be anti-cor­re­lated with where the 1 bits are in the en­vi­ron­ment’s in­puts, than the pat­tern used by the ran­dom­ized al­gorithm. In other words, it seems you have some pri­ors on the en­vi­ron­ment, that the in­puts are not uniformly dis­tributed, nor cho­sen with any rea­son­able dis­tri­bu­tion, but are in fact nega­tively cor­re­lated with the de­ter­minis­tic al­gorithm’s choices. And the con­clu­sion, then, would be that the av­er­age case perfor­mance of the de­ter­minis­tic al­gorithm is ac­tu­ally worse than the av­er­age com­puted as­sum­ing a uniform dis­tri­bu­tion of in­puts.

Now, this does hap­pen some­times. If you im­ple­ment a bub­ble sort, it’s not un­rea­son­able to guess that the al­gorithm might be given a list sorted in re­verse or­der, much more of­ten than pick­ing from a ran­dom dis­tri­bu­tion would sug­gest.

And similarly, if the n-bits al­gorithm starts look­ing at bit #1, then #2, etc. … well, it isn’t at all un­rea­son­able to sup­pose that around half the in­puts will have all the 1 bits in the right half of the string, so the naive al­gorithm will be forced to ex­hibit worst-case perfor­mance (n/​4+1 bits ex­am­ined) far more of­ten than per­haps nec­es­sary.

But this is an ar­gu­ment about av­er­age case perfor­mance of a par­tic­u­lar de­ter­minis­tic al­gorithm. Espe­cially given some in­sight into what in­puts the en­vi­ron­ment is likely to provide.

This has not been an ar­gu­ment about worst case perfor­mance of all de­ter­minis­tic al­gorithms vs. (all? any? one?) ran­dom­ized al­gorithm.

Which is what other com­menters have been in­cor­rectly as­sert­ing from the be­gin­ning, that you can “add power” to an al­gorithm by “adding ran­dom­ness”.

Maybe you can, maybe you can’t. (I hap­pen to think it highly un­likely.) But they sure haven’t shown it with these ex­am­ples.

• Don, you missed my com­ment say­ing you could bound the ran­dom­ized al­gorithm in the same way to n/​4+1 by keep­ing track of where you pick and if you find n/​4+1 0s in one half you con­clude it is in the other half.

I wouldn’t say that a ran­dom­ized al­gorithm is bet­ter per se, just more gen­er­ally use­ful than the a par­tic­u­lar de­ter­minis­tic one. You don’t have to worry about the types of in­puts given to it.

This case mat­ters be­cause I am a soft­ware cre­ator and I want to reuse my soft­ware com­po­nents. In most cases I don’t care about perfor­mance of ev­ery lit­tle sub com­po­nent too much. Sure it is not “op­ti­mal”, but me spend­ing time on op­ti­miz­ing a pro­cess that only hap­pens once is not worth it in the real world!

Ob­sess­ing about op­ti­miza­tion is not always the way to win.

• Don, if n is big enough then sure, I’m more than will­ing to play your game (as­sum­ing I get some­thing out of it if I win).

My ran­domised al­gorithm will get the right an­swer within 4 tries in ex­pec­ta­tion. If I’m al­lowed to sam­ple a thou­sand bits then the prob­a­bil­ity that it won’t have found the an­swer by then is .75^1000 which I think is some­thing like 10^-120. And this ap­plies even if you’re al­lowed to choose the string af­ter look­ing at my al­gorithm.

If you play the game this way with your de­ter­minis­tic al­gorithm you will always need to sam­ple n/​4 + 1 bits—I can put the ones where you’ll never find them. If you don’t like this game, fine don’t do com­pu­ta­tional com­plex­ity, but there’s noth­ing wrong with Scott’s ar­gu­ment.

Yes, we as­sume that you can’t pre­dict what ran­dom bits I will gen­er­ate be­fore I gen­er­ate them, but that’s pretty much what ran­dom means, so I don’t see why it’s such a big deal.

• @ Will: Yes, you’re right. You can make a ran­dom­ized al­gorithm that has the same worst-case perfor­mance as the de­ter­minis­tic one. (It may have slightly im­paired av­er­age case perfor­mance com­pared to the al­gorithm you folks had been dis­cussing pre­vi­ously, but that’s a trade­off one can make.) My only point is that con­clud­ing that the ran­dom­ized one is nec­es­sar­ily bet­ter, is far too strong a con­clu­sion (given the ev­i­dence that has been pre­sented in this thread so far).

But sure, you are cor­rect that adding a ran­dom search is a cheap way to have good con­fi­dence that your al­gorithm isn’t ac­ci­den­tally nega­tively cor­re­lated with the in­puts. So if you’re go­ing to reuse the al­gorithm in a lot of con­texts, with lots of differ­ent in­put dis­tri­bu­tions, then ran­dom­iza­tion can help you achieve av­er­age perfor­mance more of­ten than (some kinds of) de­ter­minism, which might oc­ca­sion­ally have the bad luck to set­tle into worst-case perfor­mance (in­stead of av­er­age) for some of those dis­tri­bu­tions.

But that’s not the same as say­ing that it has bet­ter worst-case com­plex­ity. (It’s ac­tu­ally say­ing that the ran­dom­ized one has bet­ter av­er­age case com­plex­ity, for the dis­tri­bu­tions you’re con­cerned about.)

• @ John: can you re­ally not see the differ­ence be­tween “this is guaran­teed to suc­ceed”, vs. “this has only a tiny like­li­hood of failure”? Those aren’t the same state­ments.

“If you play the game this way”—but why would any­one want to play a game the way you’re de­scribing? Why is that an in­ter­est­ing game to play, an in­ter­est­ing way to com­pare al­gorithms? It’s not about worst case in the real world, it’s not about av­er­age case in the real world. It’s about perfor­mance on a sce­nario that never oc­curs. Why judge al­gorithms on that ba­sis?

As for pre­dict­ing the ran­dom bits … Look, you can do what­ever you want in­side your al­gorithm. Your queries on the in­put bits are like sen­sors into an en­vi­ron­ment. Why can’t I place the bits af­ter you ask for them? And then just move the 1 bits away from whereever you hap­pened to de­cide to ask?

The point is, that you de­cided on a sce­nario that has zero rele­vance to the real world, and then did some math about that sce­nario, and thought that you learned some­thing about the al­gorithms which is use­ful when ap­ply­ing them in the real world.

But you didn’t. Your math is ir­rele­vant to how these things will perform in the real world. Be­cause your sce­nario has noth­ing to do with any ac­tual sce­nario that we see in de­ploy­ment.

(Just as an ex­am­ple: you still haven’t ac­knowl­edged the differ­ence be­tween real ran­dom sources—like a quan­tum counter—vs. PRNGs—which are ac­tu­ally de­ter­minis­tic! Yet if I pre­sented you with a “ran­dom­ized al­gorithm” for the n-bit prob­lem, which ac­tu­ally used a PRNG, I sus­pect you’d say “great! good job! good com­plex­ity”. Even though the ac­tual real al­gorithm is de­ter­minis­tic, and goes against ev­ery­thing you’ve been os­ten­si­bly ar­gu­ing dur­ing this whole thread. You need to un­der­stand that the real key is: ex­pected (anti-)cor­re­la­tions be­tween the de­ter­minis­tic choices of the al­gorithm, and the in­put data. PRNGs are suffi­cient to drop the ex­pected (anti-)cor­re­la­tions low enough for us to be happy.)

• @Scott: would it be fair to say that the P=BPP ques­tion boils down to whether one can make a “se­cure” ran­dom source such that the ad­ver­sar­ial en­vi­ron­ment can’t pre­dict it? Is that why the prob­lem is equiv­a­lent to hav­ing good PRNG? (I’m a physi­cist, but who thinks he knows some of the ter­minol­ogy of com­plex­ity the­ory—apolo­gies if I’ve mi­s­un­der­stood the terms.)

• I can’t help but think that the rea­son why ran­dom­iza­tion ap­pear to add in­for­ma­tion in the con­text of ma­chine learn­ing clas­sifiers for the di­choto­mous case is that the ac­tual eval­u­a­tion of the al­gorithms is em­piri­cal, and, as such, use finite data. Two classes of non-de­monic er­ror always ex­ist in the finite case. Th­ese are mea­sure­ment er­ror, and sam­pling er­ror.

Mul­ti­at­tribute clas­sifiers use in­di­vi­d­ual mea­sure­ments from many at­tributes. No em­piri­cal mea­sure­ment is made with­out er­ror; the more at­tributes that are mea­sured, the more mea­sure­ment er­ror in­trudes.

No finite sam­ple is an ex­act sam­ple of the en­tire pop­u­la­tion. To the mea­sure­ment er­ror for each at­tribute, we must add sam­pling er­ror. Ran­dom se­lec­tion of in­di­vi­d­u­als for train­ing set sam­ples, and for test set sam­ples, lead to data sets that are ap­prox­i­ma­tions of the sam­ple. The sam­ple is an ap­prox­i­ma­tion of the pop­u­la­tion.

In­deed, the en­tire finite pop­u­la­tion is but a sam­ple of what the pop­u­la­tion could be, given time, pop­u­la­tions change.

So, the gen­er­al­iz­able perfor­mance (say, ac­cu­racy) of AI al­gorithms is hard to nail down pre­cisely, un­less it ei­ther is a perfect clas­sifier (fu­sion = 0 → planets vs. fu­sion = 1-> suns, for ex­am­ple, at a spe­cific TIME).

I con­tend, there­fore, that some of the ‘im­prove­ment’ ob­served due to ran­dom­iza­tion is ac­tu­ally overfit of the pre­dic­tion model to the finite sam­ple.

I’m in the field of bioin­for­mat­ics, and in biol­ogy, ge­net­ics, ge­nomics, pro­teomics, medicine, we always ad­mit to mea­sure­ment er­ror.

• DonGed­dis is miss­ing the point. Ran­dom­ness does buy power. The sim­plest ex­am­ple is sam­pling in statis­tics: to es­ti­mate the frac­tion of read-headed peo­ple in the world to within 5% pre­ci­sion and with con­fi­dence 99%, it is enough to draw a cer­tain num­ber of ran­dom sam­ples and com­pute the frac­tion of read-headed peo­ple in the sam­ple. The num­ber of sam­ples re­quired is in­de­pen­dent of the pop­u­la­tion size. No de­ter­minis­tic al­gorithm can do this, not even if you try to simu­late sam­ples by us­ing good PRNGs, be­cause there are ``or­der­ings″ of the world’s pop­u­la­tion that would fool your al­gorithm. But if you use ran­dom bits that are in­de­pen­dent of the data, you can­not be fooled. (By the way, whether ‘true ran­dom­ness’ is re­ally pos­si­ble in the real world is a to­tally differ­ent is­sue.)

Note that, for de­ci­sion prob­lems, ran­dom­ness is not be­lieved to give you a more-than-polyno­mial speedup (this is the P vs BPP ques­tion). But nei­ther sam­pling nor the promise prob­lem Scott men­tioned are de­ci­sion prob­lems (defined on all in­puts).

Re­gard­ing your claim that you can­not com­pare ‘worst-case time’ for de­ter­minis­tic al­gorithms with ‘worst-case ex­pected time’ for ran­dom­ized ones: this is to­tally pointless and shows a to­tal lack of un­der­stand­ing of ran­dom­iza­tion. No one claims that you can have a ran­dom­ized al­gorithm with suc­cess prob­a­bil­ity one that out­performs the de­ter­minis­tic al­gorithm. So ei­ther we use the ex­pected run­ning time or the ran­dom­ized al­gorithm, or we al­low a small er­ror prob­a­bil­ity.

We can choose ei­ther one and use the same defi­ni­tion of the prob­lem for both de­ter­minis­tic and ran­dom­ized. Take Scott ex­am­ple and sup­pose we want al­gorithms that suc­ceed with prob­a­bil­ity 23 at least, on ev­ery in­put. The goal is the same in both cases, so they are com­pa­rable. In the case of de­ter­minis­tic al­gorithms, this im­plies always get­ting the cor­rect an­swer. DonGed­dis seems not to re­al­ize this, and con­trary to what he says, any de­ter­minis­tic al­gorithm, no mat­ter how clever, needs to look at n/​4+1 el­e­ments in the worst case. (By the way, maybe DonGed­dis prob­lem is that he doesn’t see the proof of this fact, but it eas­ily prov­able with stan­dard tech­niques.) How­ever, a ran­dom­ized al­gorithm that makes a worst-case con­stant num­ber of queries will suc­ceed with prob­a­bil­ity 23. If we want higher suc­ceed prob­a­bil­ity, we can just re­peat a con­stant num­ber of times. Note that I’m not talk­ing about the ex­pected com­plex­ity but about worst-case com­plex­ity, if the former both­ers you; this comes at the ex­pense of a small er­ror prob­a­bil­ity.

As for Blum’s pa­per, it shows that the best bounds we can prove for ran­dom­ized al­gorithms in that par­tic­u­lar prob­lem out­perform the bounds we can prove for de­ter­minis­tic al­gorithms. Sure, that does not im­ply the former are bet­ter, but it’s the best bound we can prove. To prove that the ran­dom­ized al­gorithm in this case has a bet­ter con­stant, one needs to find a lower bound for the de­ter­minis­tic al­gorithm.

• be­cause there are ``or­der­ings″ of the world’s pop­u­la­tion that would fool your algorithm

So don’t use those or­der­ings. That ran­dom­ness isn’t buy­ing you power. It’s act­ing as though you know noth­ing about the dis­tri­bu­tion of red-headed peo­ple. If you do, in fact, know some­thing about the dis­tri­bu­tion of red-headed peo­ple (i.e. there are more red-headed peo­ple than av­er­age in Ire­land and fewer in Africa, there­fore make sure each sam­ple draws pro­por­tion­ally from Ire­land, Africa, and non-Ire­land-non-Africa), then you can ex­ploit that to make your pre­dic­tion more re­li­able.

• When the en­vi­ron­ment is ad­ver­sar­ial, smarter than you are, and in­formed about your meth­ods, then in a the­o­ret­i­cal sense it may be wise to have a quan­tum noise source handy.

It can be handy even if it’s not ad­ver­sar­ial, and even if it is equally as smart as you are but not smarter. Sup­pose you’re play­ing a game with pay­offs (A,A) = (0,0), (A,B) = (1,1), (B,A) = (1,1), (B,B) = (0,0) against a clone of your­self (and you can’t talk to each other); the mixed strat­egy where you pick A half of the time and B half of the time wins half of the time, but ei­ther pure strat­egy always loses. This even though the other player isn’t ad­ver­sar­ial here—in fact, his util­ity func­tion is ex­actly the same as yours.

• Some­times the en­vi­ron­ment re­ally is ad­ver­sar­ial, though.

Reg­u­lar ex­pres­sions as im­ple­mented in many pro­gram­ming lan­guages work fine for al­most all in­puts, but with ter­rible up­per bounds. This is why libraries like re2 are needed if you want to let any­one on the In­ter­net use reg­u­lar ex­pres­sions in your search en­g­ine.

Another ex­am­ple that comes to mind is that quick­sort runs in n·log n ex­pected time if ran­domly sorted be­fore­hand.

• The In­ter­net can be pretty scary, but it is still very much short of an ad­ver­sar­ial su­per­in­tel­li­gence. Sure, some­times peo­ple will want to bring your web­site down, and some­times it could be an or­ga­nized effort, but even that is noth­ing like a gen­uine worst case.

• OK, granted, in the case of gen­eral regex we’re talk­ing about up­per bounds that are ex­po­nen­tial in the worst case, and it’s not very difficult to come up with re­ally bad regex cases, so in that case a sin­gle query could cause you ma­jor is­sues.

That doesn’t nec­es­sar­ily mean you should switch to a library that avoids those worst cases, though. You could sim­ply re­ject the kinds of regexes that could po­ten­tially cause those is­sues, or run your regex matcher with a time­out and have the search fail when­ever the timer ex­pires.

• I wouldn’t mind know­ing the rea­son I was down­voted here.

• Seems like noise to me. Only one per­son cared enough about what you wrote to vote, and they dis­liked your post. Upvoted back to zero, since the com­ments seem pretty rea­son­able to me.

• Thanks; it didn’t oc­cur to me to think of it statis­ti­cally, al­though it seems ob­vi­ous now.

That said, in­di­vi­d­ual down­votes /​ up­votes still have rea­sons be­hind them; it’s just that you can’t gen­er­ally ex­pect to find out what they are ex­cept when they come in sig­nifi­cant clusters.

• If you have an search en­g­ine you don’t want to al­low peo­ple to search your whole cor­pus any­way. You usu­ally want to search an in­dex.

• So let me see if I’ve got this straight.

Com­puter Scien­tists: For some prob­lems, there are ran­dom al­gorithms with the prop­erty that they suc­ceed with high prob­a­bil­ity for any pos­si­ble in­put. No de­ter­minis­tic al­gorithm for these prob­lems has this prop­erty. There­fore, ran­dom al­gorithms are su­pe­rior.

Eliezer: But if we knew the prob­a­bil­ity dis­tri­bu­tion over pos­si­ble in­puts, we could cre­ate a de­ter­minis­tic al­gorithm with the prop­erty.

Com­puter Scien­tists: But we do not know the prob­a­bil­ity dis­tri­bu­tion over pos­si­ble in­puts.

Eliezer: Never say “I don’t know”! If you are in a state of ig­no­rance, use an ig­no­rance prior.

Now of course the key ques­tion is what sort of ig­no­rance prior we should use. In Jaynes’s book, usu­ally ig­no­rance pri­ors with some sort of nice in­var­i­ance prop­er­ties are used, which makes calcu­la­tions sim­pler. For ex­am­ple if the in­put is a bit stream then we could as­sume that the bits are in­de­pen­dent coin­flips. How­ever, in real life this does not cor­re­spond to a state of ig­no­rance, but rather to a state of knowl­edge where we know that the bit stream does not con­tain pre­dictable cor­re­la­tions. For ex­am­ple, the prob­a­bil­ity of 1000 ze­ros in a row ac­cord­ing to this ig­no­rance prior is 10^{-300}, which is not even re­motely close to the in­tu­itive prob­a­bil­ity.

The next step is to try to cre­ate an ig­no­rance prior which some­how for­mal­izes Oc­cam’s ra­zor. As any­one fa­mil­iar with MIRI’s work on the prob­lem of log­i­cal pri­ors should know, this is more difficult than it sounds. Essen­tially, the best solu­tion so far (see “Log­i­cal In­duc­tion”, Garrabrant et al. 2016) is to cre­ate a list of all of the ways the en­vi­ron­ment could con­tain pre­dictable cor­re­la­tions (or in the lan­guage of this post, re­sem­ble an “ad­ver­sar­ial telepath”), and then trade them off of each other to get a fi­nal prob­a­bil­ity. One of the main down­sides of the al­gorithm is that it is not pos­si­ble to list all of the pos­si­ble ways the en­vi­ron­ment could be cor­re­lated (since there are in­finitely many), so you have to limit your­self to tak­ing a fea­si­ble sam­ple.

Now, it is worth not­ing that the above pa­per is con­cerned with an “en­vi­ron­ment” which is re­ally just the truth-val­ues of math­e­mat­i­cal state­ments! It is hard to see how this en­vi­ron­ment re­sem­bles any sort of “ad­ver­sar­ial telepath”. But if we want to main­tain the ethos of this post, it seems that we are forced to this con­clu­sion. Other­wise, an en­vi­ron­ment with log­i­cal un­cer­tainty could con­sti­tute a coun­terex­am­ple to the claim that ran­dom­ness never helps.

To be pre­cise, let f be a math­e­mat­i­cal for­mula with one free vari­able rep­re­sent­ing an in­te­ger, and sup­pose we are given ac­cess to an or­a­cle which can tell us the truth-val­ues of the state­ments f(1),...,f(N). The prob­lem is to com­pute (up to a fixed ac­cu­racy) the pro­por­tion of state­ments which are true, with the re­stric­tion that we can only make n queries, where n << N. Monte Carlo meth­ods suc­ceed with failure prob­a­bil­ity ex­po­nen­tial in n, re­gard­less of what f is.

Now sup­pose that f is de­ter­mined by choos­ing m bits ran­domly, where m << n, and in­ter­pret­ing them as a math­e­mat­i­cal for­mula (throw­ing out the re­sults and try­ing again if it is im­pos­si­ble to do so). Then if the min­i­mum failure prob­a­bil­ity is nonzero, it is ex­po­nen­tial in m, not n, and there­fore big­ger than the failure prob­a­bil­ity for Monte Carlo meth­ods. How­ever, any al­gorithm which can be en­coded in less than ~m bits fails with nonzero prob­a­bil­ity for di­ag­o­nal­iza­tion rea­sons.

In fact, the di­ag­o­nal­iza­tion failure is one of the least im­por­tant ones, the main point is that you just don’t know enough about the en­vi­ron­ment to jus­tify writ­ing any par­tic­u­lar al­gorithm. Any de­ter­minis­ti­cally cho­sen se­quence has a good chance of be­ing cor­re­lated with the en­vi­ron­ment, just be­cause the en­vi­ron­ment is math and math things are of­ten cor­re­lated. Now, we can call this an “ad­ver­sar­ial telepath” if we want, but it seems to oc­cur of­ten enough in real life that this des­ig­na­tion hardly seems to “carve re­al­ity at its joints”.

TL;DR: If even math can be called an “ad­ver­sar­ial telepath”, the term seems to have lost all its mean­ing.