The Weighted Majority Algorithm

Fol­lowup to: Worse Than Ran­dom, Trust In Bayes

In the wider field of Ar­tifi­cial In­tel­li­gence, it is not uni­ver­sally agreed and ac­knowl­edged that noise hath no power. In­deed, the con­ven­tional view in ma­chine learn­ing is that ran­dom­ized al­gorithms some­times perform bet­ter than un­ran­dom­ized coun­ter­parts and there is noth­ing pe­cu­liar about this. Thus, read­ing an or­di­nary pa­per in the AI liter­a­ture, you may sud­denly run across a re­mark: “There is also an im­proved ver­sion of this al­gorithm, which takes ad­van­tage of ran­dom­iza­tion...”

Now for my­self I will be in­stantly sus­pi­cious; I shall in­spect the math for rea­sons why the un­ran­dom­ized al­gorithm is be­ing some­where stupid, or why the ran­dom­ized al­gorithm has a hid­den dis­ad­van­tage. I will look for some­thing pe­cu­liar enough to ex­plain the pe­cu­liar cir­cum­stance of a ran­dom­ized al­gorithm some­how do­ing bet­ter.

I am not com­pletely alone in this view. E. T. Jaynes, I found, was of the same mind: “It ap­pears to be a quite gen­eral prin­ci­ple that, when­ever there is a ran­dom­ized way of do­ing some­thing, then there is a non­ran­dom­ized way that de­liv­ers bet­ter perfor­mance but re­quires more thought.” Ap­par­ently there’s now a small cot­tage in­dus­try in de­ran­dom­iz­ing al­gorithms. But so far as I know, it is not yet the ma­jor­ity, main­stream view that “we can im­prove this al­gorithm by ran­dom­iz­ing it” is an ex­tremely sus­pi­cious thing to say.

Let us now con­sider a spe­cific ex­am­ple—a main­stream AI al­gorithm where there is, ap­par­ently, a math­e­mat­i­cal proof that the ran­dom­ized ver­sion performs bet­ter. By show­ing how sub­tle the gotcha can be, I hope to con­vince you that, even if you run across a case where the ran­dom­ized al­gorithm is widely be­lieved to perform bet­ter, and you can’t find the gotcha your­self, you should nonethe­less trust that there’s a gotcha to be found.

For our par­tic­u­lar ex­am­ple we shall ex­am­ine the weighted ma­jor­ity al­gorithm, first in­tro­duced by Lit­tle­stone and War­muth, but as a sub­stan­tially more read­able on­line refer­ence I will use sec­tion 2 of Blum’s On-Line Al­gorithms in Ma­chine Learn­ing. (Blum has an eas­ier-to-read, more con­crete ver­sion of the proofs; and the un­ran­dom­ized and ran­dom­ized al­gorithms are pre­sented side-by-side rather than far part.)

The weighted ma­jor­ity al­gorithm is an en­sem­ble method—a way to com­bine the ad­vice from sev­eral other al­gorithms or hy­pothe­ses, called “ex­perts”. Ex­perts strive to clas­sify a set of en­vi­ron­men­tal in­stances into “pos­i­tive” and “nega­tive” cat­e­gories; each ex­pert pro­duces a pre­dic­tion for each in­stance. For ex­am­ple, each ex­pert might look at a photo of a for­est, and try to de­ter­mine whether or not there is a cam­ou­flaged tank in the for­est. Ex­pert pre­dic­tions are bi­nary, ei­ther “pos­i­tive” or “nega­tive”, with no prob­a­bil­ity at­tached. We do not as­sume that any of our ex­perts is perfect, and we do not know which of our ex­perts is the best. We also do not as­sume any­thing about the sam­ples (they may not be in­de­pen­dent and iden­ti­cally dis­tributed).

The weighted ma­jor­ity al­gorithm ini­tially as­signs an equal weight of 1 to all ex­perts. On each round, we ask all the ex­perts for their pre­dic­tions, and sum up the weights for each of the two pos­si­ble pre­dic­tions, “pos­i­tive” or “nega­tive”. We out­put the pre­dic­tion that has the higher weight. Then, when we see the ac­tual re­sult, we mul­ti­ply by 12 the weight of ev­ery ex­pert that got the pre­dic­tion wrong.

Sup­pose the to­tal num­ber of ex­perts is n, and the best ex­pert makes no more than m mis­takes over some given sam­pling se­quence. Then we can prove that the weighted ma­jor­ity al­gorithm makes a to­tal num­ber of mis­takes M that is bounded by 2.41*(m + log2(n)). In other words, the weighted ma­jor­ity al­gorithm makes no more mis­takes than the best ex­pert, plus a fac­tor of log n, times a con­stant fac­tor of 2.41.

Proof (taken from Blum 1996; a similar proof ap­pears in Lit­tle­stone and War­muth 1989): The com­bined weight of all the ex­perts at the start of the prob­lem is W = n. If the weighted ma­jor­ity al­gorithm makes a mis­take, at least half the to­tal weight of ex­perts pre­dicted in­cor­rectly, so the to­tal weight is re­duced by a fac­tor of at least 14. Thus, af­ter the weighted ma­jor­ity al­gorithm makes M mis­takes, its to­tal weight W has been re­duced by at least (3/​4)^M. In other words:

W ⇐ n*( (3/​4)^M )

But if the best ex­pert has made at most m mis­takes, its weight is at least (1/​2)^m. And since W in­cludes the weight of all ex­perts,

W >= (1/​2)^m

There­fore:

(1/​2)^m ⇐ W ⇐ n*( (3/​4)^M )
(1/​2)^m ⇐ n*( (3/​4)^M )
-m ⇐ log2(n) + M*log2(3/​4)
-m ⇐ log2(n) + M*-log2(4/​3)
M*log2(4/​3) ⇐ m + log2(n)
M ⇐ (1/​log2(4/​3))*(m + log2(n))
M ⇐ 2.41*(m + log2(n))

Blum then says that “we can achieve a bet­ter bound than that de­scribed above”, by ran­dom­iz­ing the al­gorithm to pre­dict “pos­i­tive” or “nega­tive” with prob­a­bil­ity pro­por­tional to the weight as­signed each pre­dic­tion. Thus, if 34 of the ex­pert weight went to “pos­i­tive”, we would pre­dict “pos­i­tive” with prob­a­bil­ity 75%, and “nega­tive” with prob­a­bil­ity 25%.

An es­sen­tially similar proof, sum­ming over the ex­pected prob­a­bil­ity of a mis­take on each round, will show that in this case:

M ⇐ 1.39m + 2 ln(n) (note: M is now an ex­pec­ta­tion)

Since the penalty ap­plied to par­tic­u­lar ex­perts does not de­pend on the global pre­dic­tion but only on the ac­tual out­come, most of the proof pro­ceeds as be­fore. We have

W >= (1/​2)^m

where again m is the best ex­pert and W is the to­tal weight of all ex­perts.

We also have that W is the start­ing weight n times the product of (1 − 12 F_i), where F_i is the frac­tion of mis­taken ex­pert weight on round i:

W = n * Prod_i (1 − 12 F_i)

And since we pre­dict with prob­a­bil­ity pro­por­tional to the ex­pert weight, the ex­pected num­ber of mis­takes is just the sum over F_i:

M = Sum_i F_i

So:

(1/​2)^m ⇐ n * Prod_i (1 − 12 F_i)
-m*ln(2) ⇐ ln(n) + Sum_i ln(1 − 12 F_i)
Sum_i -ln(1 − 12 F_i) ⇐ ln(n) + m*ln(2)
Sum_i (1/​2 F_i) ⇐ ln(n) + m*ln(2)
(be­cause x < -ln(1 - x) )
Sum_i F_i ⇐ 2*(ln(n) + m*ln(2))
M ⇐ 1.39m + 2 ln(n)

Be­hold, we have done bet­ter by ran­dom­iz­ing!

We should be es­pe­cially sus­pi­cious that the ran­dom­ized al­gorithm guesses with prob­a­bil­ity pro­por­tional to the ex­pert weight as­signed. This seems strongly rem­i­nis­cent of bet­ting with 70% prob­a­bil­ity on blue, when the en­vi­ron­ment is a ran­dom mix of 70% blue and 30% red cards. We know the best bet—and yet we only some­times make this best bet, at other times bet­ting on a con­di­tion we be­lieve to be less prob­a­ble.

Yet we thereby prove a smaller up­per bound on the ex­pected er­ror. Is there an alge­braic er­ror in the sec­ond proof? Are we ex­tract­ing use­ful work from a noise source? Is our knowl­edge harm­ing us so much that we can do bet­ter through ig­no­rance?

Maybe the un­ran­dom­ized al­gorithm fails to take into ac­count the Bayesian value of in­for­ma­tion, and hence fails to ex­hibit ex­plo­ra­tory be­hav­ior? Maybe it doesn’t test un­usual cases to see if they might have sur­pris­ing re­sults?

But, ex­am­in­ing the as­sump­tions, we see that the feed­back we re­ceive is fixed, re­gard­less of the pre­dic­tion’s out­put. Nor does the up­dat­ing of the ex­pert weights de­pend on the pre­dic­tions we out­put. It doesn’t mat­ter whether we sub­sti­tute a com­pletely ran­dom string of 1s and 0s as our ac­tual out­put. We get back ex­actly the same data from the en­vi­ron­ment, and the weighted ma­jor­ity al­gorithm up­dates the ex­pert weights in ex­actly the same way. So that can’t be the ex­pla­na­tion.

Are the com­po­nent ex­perts do­ing worse than ran­dom, so that by ran­dom­iz­ing our pre­dic­tions, we can creep back up to­ward max­i­mum en­tropy? But we didn’t as­sume any­thing about how of­ten the com­po­nent ex­perts were right, or use the fact in the proofs. There­fore the proofs would carry even if we speci­fied that all ex­perts were right at least 60% of the time. It’s hard to see how ran­dom­iz­ing our pre­dic­tions could help, in that case—but the proofs still go through, un­changed.

So where’s the gotcha?

Maybe I’m about to tell you that I looked, and I couldn’t find the gotcha ei­ther. What would you be­lieve, in that case? Would you think that the whole the­sis on the fu­til­ity of ran­dom­ness was prob­a­bly just wrong—a rea­son­able-sound­ing philo­soph­i­cal ar­gu­ment that sim­ply wasn’t cor­rect in prac­tice? Would you de­spair of be­ing able to fol­low the math, and give up and shrug, un­able to de­cide who might be right?

We don’t always see the flaws right away, and that’s some­thing to always re­mem­ber.

In any case, I’m about to start ex­plain­ing the gotcha. If you want to go back, read the pa­per, and look your­self, you should stop read­ing now...

There does ex­ist a rare class of oc­ca­sions where we want a source of “true” ran­dom­ness, such as a quan­tum mea­sure­ment de­vice. For ex­am­ple, you are play­ing rock-pa­per-scis­sors against an op­po­nent who is smarter than you are, and who knows ex­actly how you will be mak­ing your choices. In this con­di­tion it is wise to choose ran­domly, be­cause any method your op­po­nent can pre­dict will do worse-than-av­er­age. As­sum­ing that the en­emy knows your source code has strange con­se­quences: the ac­tion se­quence 1, 2, 1, 3 is good when de­rived from a quan­tum noise gen­er­a­tor, but bad when de­rived from any de­ter­minis­tic al­gorithm, even though it is the same ac­tion se­quence. Still it is not a to­tally un­re­al­is­tic situ­a­tion. In real life, it is, in fact, a bad idea to play rock-pa­per-scis­sors us­ing an al­gorithm your op­po­nent can pre­dict. So are we, in this situ­a­tion, de­riv­ing op­ti­miza­tion from noise?

The ran­dom rock-pa­per-scis­sors player does not play clev­erly, rack­ing up lots of points. It does not win more than 13 of the time, or lose less than 13 of the time (on av­er­age). The ran­dom­ized player does bet­ter be­cause its al­ter­na­tives perform poorly, not from be­ing smart it­self. More­over, by as­sump­tion, the op­po­nent is an in­tel­li­gence whom we can­not out­smart and who always knows ev­ery­thing about any method we use. There is no move we can make that does not have a pos­si­ble coun­ter­move. By as­sump­tion, our own in­tel­li­gence is en­tirely use­less. The best we can do is to avoid all reg­u­lar­ity that the en­emy can ex­ploit. In other words, we do best by min­i­miz­ing the effec­tive­ness of in­tel­li­gence within the sys­tem-as-a-whole, be­cause only the en­emy’s in­tel­li­gence is al­lowed to be effec­tive. If we can’t be clever our­selves, we might as well think of our­selves as the en­vi­ron­ment and the en­emy as the sole in­tel­li­gence within that en­vi­ron­ment. By be­com­ing the max­i­mum-en­tropy en­vi­ron­ment for rock-pa­per-scis­sors, we ren­der all in­tel­li­gence use­less, and do (by as­sump­tion) the best we can do.

When the en­vi­ron­ment is ad­ver­sar­ial, smarter than you are, and in­formed about your meth­ods, then in a the­o­ret­i­cal sense it may be wise to have a quan­tum noise source handy. (In a prac­ti­cal sense you’re just screwed.) Again, this is not be­cause we’re ex­tract­ing use­ful work from a noise source; it’s be­cause the most pow­er­ful in­tel­li­gence in the sys­tem is ad­ver­sar­ial, and we’re min­i­miz­ing the power that in­tel­li­gence can ex­ert in the sys­tem-as-a-whole. We don’t do bet­ter-than-av­er­age, we merely min­i­mize the ex­tent to which the ad­ver­sar­ial in­tel­li­gence pro­duces an out­come that is worse-than-av­er­age (from our per­spec­tive).

Similarly, cryp­tog­ra­phers have a le­gi­t­i­mate in­ter­est in strong ran­dom­ness gen­er­a­tors be­cause cryp­tog­ra­phers are try­ing to min­i­mize the effec­tive­ness of an in­tel­li­gent ad­ver­sary. Cer­tainly en­tropy can act as an an­ti­dote to in­tel­li­gence.

Now back to the weighted ma­jor­ity al­gorithm. Blum (1996) re­marks:

“In­tu­itively, the ad­van­tage of the ran­dom­ized ap­proach is that it dilutes the worst case. Pre­vi­ously, the worst case was that slightly more than half of the to­tal weight pre­dicted in­cor­rectly, caus­ing the al­gorithm to make a mis­take and yet only re­duce the to­tal weight by 14. Now there is roughly a 5050 chance that the [ran­dom­ized] al­gorithm will pre­dict cor­rectly in this case, and more gen­er­ally, the prob­a­bil­ity that the al­gorithm makes a mis­take is tied to the amount that the weight is re­duced.”

From the start, we did our anal­y­sis for an up­per bound on the num­ber of mis­takes made. A global up­per bound is no bet­ter than the worst in­di­vi­d­ual case; thus, to set a global up­per bound we must bound the worst in­di­vi­d­ual case. In the worst case, our en­vi­ron­ment be­haves like an ad­ver­sar­ial su­per­in­tel­li­gence.

In­deed ran­dom­ness can im­prove the worst-case sce­nario, if the worst-case en­vi­ron­ment is al­lowed to ex­ploit “de­ter­minis­tic” moves but not “ran­dom” ones. It is like an en­vi­ron­ment that can de­cide to pro­duce a red card when­ever you bet on blue—un­less you make the same bet us­ing a “ran­dom” num­ber gen­er­a­tor in­stead of your cre­ative in­tel­li­gence.

Sup­pose we use a quan­tum mea­sure­ment de­vice to pro­duce a string of ran­dom ones and ze­roes; make two copies of this string; use one copy for the weighted ma­jor­ity al­gorithm’s ran­dom num­ber gen­er­a­tor; and give an­other copy to an in­tel­li­gent ad­ver­sary who picks our sam­ples. In other words, we let the weighted ma­jor­ity al­gorithm make ex­actly the same ran­dom­ized pre­dic­tions, pro­duced by ex­actly the same gen­er­a­tor, but we also let the “worst case” en­vi­ron­ment know what these ran­dom­ized pre­dic­tions will be. Then the im­proved up­per bound of the ran­dom­ized ver­sion is math­e­mat­i­cally in­val­i­dated.

This shows that the im­proved up­per bound proven for the ran­dom­ized al­gorithm did not come from the ran­dom­ized al­gorithm mak­ing sys­tem­at­i­cally bet­ter pre­dic­tions—do­ing su­pe­rior cog­ni­tive work, be­ing more in­tel­li­gent—but be­cause we ar­bi­trar­ily de­clared that an in­tel­li­gent ad­ver­sary could read our mind in one case but not the other.

This is not just a minor quib­ble. It leads to the testable pre­dic­tion that on real-world prob­lems, where the en­vi­ron­ment is usu­ally not an ad­ver­sar­ial telepath, the un­ran­dom­ized weighted-ma­jor­ity al­gorithm should do bet­ter than the ran­dom­ized ver­sion. (Pro­vided that some com­po­nent ex­perts out­perform max­i­mum en­tropy—are right more than 50% of the time on bi­nary prob­lems.)

An­a­lyz­ing the worst-case sce­nario is stan­dard prac­tice in com­pu­ta­tional learn­ing the­ory, but it makes the math do strange things. More­over, the as­sump­tion of the worst case can be sub­tle; it is not always ex­plic­itly la­beled. Con­sider the up­per bound for the un­ran­dom­ized weighted-ma­jor­ity al­gorithm. I did not use the term “worst case”—I merely wrote down some in­equal­ities. In fact, Blum (1996), when ini­tially in­tro­duc­ing this bound, does not at first use the phrase “worst case”. The proof’s con­clu­sion says only:

“The­o­rem 1. The num­ber of mis­takes M made by the Weighted Ma­jor­ity al­gorithm de­scribed above is never more than 2.41(m+lg n), where m is the num­ber of mis­takes made by the best ex­pert so far.”

Key word: Never.

The worst-case as­sump­tion for the un­ran­dom­ized al­gorithm was im­plicit in calcu­lat­ing the right-hand-side of the in­equal­ity by as­sum­ing that, on each mis­take, the to­tal weight went down by a fac­tor of 14, and the to­tal weight never de­creased af­ter any suc­cess­ful pre­dic­tion. This is the ab­solute worst pos­si­ble perfor­mance the weighted-ma­jor­ity al­gorithm can give.

The as­sump­tion im­plicit in those in­no­cent-look­ing equa­tions is that the en­vi­ron­ment care­fully max­i­mizes the an­guish of the pre­dic­tor: The en­vi­ron­ment so clev­erly chooses its data sam­ples, that on each case where the weighted-ma­jor­ity al­gorithm is al­lowed to be suc­cess­ful, it shall re­ceive not a sin­gle iota of use­ful feed­back—not a sin­gle ex­pert shall be wrong. And the weighted-ma­jor­ity al­gorithm will be made to err on sen­sory cases that pro­duce the min­i­mum pos­si­ble use­ful feed­back, mal­i­ciously fine-tuned to the ex­act cur­rent weight­ing of ex­perts. We as­sume that the en­vi­ron­ment can pre­dict ev­ery as­pect of the pre­dic­tor and ex­ploit it—un­less the pre­dic­tor uses a “ran­dom” num­ber gen­er­a­tor which we ar­bi­trar­ily de­clare to be un­know­able to the ad­ver­sary.

What strange as­sump­tions are buried in that in­no­cent lit­tle in­equal­ity,

<=

More­over, the en­tire ar­gu­ment in fa­vor of the ran­dom­ized al­gorithm was the­o­ret­i­cally sus­pect from the be­gin­ning, be­cause it rested on non-tran­si­tive in­equal­ities. If I prove an up­per bound on the er­rors of al­gorithm X, and then prove a smaller up­per bound on the er­rors of al­gorithm Y, it does not prove that in the real world Y will perform bet­ter than X. For ex­am­ple, I prove that X can­not do worse than 1000 er­rors, and that Y can­not do worse than 100 er­rors. Is Y a bet­ter al­gorithm? Not nec­es­sar­ily. To­mor­row I could find an im­proved proof which shows that X can­not do worse than 90 er­rors. And then in real life, both X and Y could make ex­actly 8 er­rors.

4 ⇐ 1,000,000,000
9 ⇐ 10

But that doesn’t mean that 4 > 9.

So the next time you see an au­thor re­mark, “We can fur­ther im­prove this al­gorithm us­ing ran­dom­iza­tion...” you may not know ex­actly where to find the odd­ity. If you’d run across the above-refer­enced ex­am­ple (or any num­ber of oth­ers in the ma­chine-learn­ing liter­a­ture), you might not have known how to de­con­struct the ran­dom­ized weighted-ma­jor­ity al­gorithm. If such a thing should hap­pen to you, then I hope that I have given you grounds to sus­pect that an odd­ity ex­ists some­where, even if you can­not find it—just as you would sus­pect a ma­chine that pro­posed to ex­tract work from a heat bath, even if you couldn’t keep track of all the gears and pipes.

Nomin­ull put it very com­pactly when he said that, bar­ring an en­vi­ron­ment which changes based on the form of your al­gorithm apart from its out­put, “By adding ran­dom­ness to your al­gorithm, you spread its be­hav­iors out over a par­tic­u­lar dis­tri­bu­tion, and there must be at least one point in that dis­tri­bu­tion whose ex­pected value is at least as high as the av­er­age ex­pected value of the dis­tri­bu­tion.”

As I re­marked in Per­pet­ual Mo­tion Beliefs:

I once knew a fel­low who was con­vinced that his sys­tem of wheels and gears would pro­duce re­ac­tion­less thrust, and he had an Ex­cel spread­sheet that would prove this—which of course he couldn’t show us be­cause he was still de­vel­op­ing the sys­tem. In clas­si­cal me­chan­ics, vi­o­lat­ing Con­ser­va­tion of Mo­men­tum is prov­ably im­pos­si­ble. So any Ex­cel spread­sheet calcu­lated ac­cord­ing to the rules of clas­si­cal me­chan­ics must nec­es­sar­ily show that no re­ac­tion­less thrust ex­ists—un­less your ma­chine is com­pli­cated enough that you have made a mis­take in the calcu­la­tions.

If you ever find your­self math­e­mat­i­cally prov­ing that you can do bet­ter by ran­dom­iz­ing, I sug­gest that you sus­pect your calcu­la­tions, or sus­pect your in­ter­pre­ta­tion of your as­sump­tions, be­fore you cel­e­brate your ex­trac­tion of work from a heat bath.