The Weighted Majority Algorithm
In the wider field of Artificial Intelligence, it is not universally agreed and acknowledged that noise hath no power. Indeed, the conventional view in machine learning is that randomized algorithms sometimes perform better than unrandomized counterparts and there is nothing peculiar about this. Thus, reading an ordinary paper in the AI literature, you may suddenly run across a remark: “There is also an improved version of this algorithm, which takes advantage of randomization...”
Now for myself I will be instantly suspicious; I shall inspect the math for reasons why the unrandomized algorithm is being somewhere stupid, or why the randomized algorithm has a hidden disadvantage. I will look for something peculiar enough to explain the peculiar circumstance of a randomized algorithm somehow doing better.
I am not completely alone in this view. E. T. Jaynes, I found, was of the same mind: “It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.” Apparently there’s now a small cottage industry in derandomizing algorithms. But so far as I know, it is not yet the majority, mainstream view that “we can improve this algorithm by randomizing it” is an extremely suspicious thing to say.
Let us now consider a specific example—a mainstream AI algorithm where there is, apparently, a mathematical proof that the randomized version performs better. By showing how subtle the gotcha can be, I hope to convince you that, even if you run across a case where the randomized algorithm is widely believed to perform better, and you can’t find the gotcha yourself, you should nonetheless trust that there’s a gotcha to be found.
For our particular example we shall examine the weighted majority algorithm, first introduced by Littlestone and Warmuth, but as a substantially more readable online reference I will use section 2 of Blum’s On-Line Algorithms in Machine Learning. (Blum has an easier-to-read, more concrete version of the proofs; and the unrandomized and randomized algorithms are presented side-by-side rather than far part.)
The weighted majority algorithm is an ensemble method—a way to combine the advice from several other algorithms or hypotheses, called “experts”. Experts strive to classify a set of environmental instances into “positive” and “negative” categories; each expert produces a prediction for each instance. For example, each expert might look at a photo of a forest, and try to determine whether or not there is a camouflaged tank in the forest. Expert predictions are binary, either “positive” or “negative”, with no probability attached. We do not assume that any of our experts is perfect, and we do not know which of our experts is the best. We also do not assume anything about the samples (they may not be independent and identically distributed).
The weighted majority algorithm initially assigns an equal weight of 1 to all experts. On each round, we ask all the experts for their predictions, and sum up the weights for each of the two possible predictions, “positive” or “negative”. We output the prediction that has the higher weight. Then, when we see the actual result, we multiply by 1⁄2 the weight of every expert that got the prediction wrong.
Suppose the total number of experts is n, and the best expert makes no more than m mistakes over some given sampling sequence. Then we can prove that the weighted majority algorithm makes a total number of mistakes M that is bounded by 2.41*(m + log2(n)). In other words, the weighted majority algorithm makes no more mistakes than the best expert, plus a factor of log n, times a constant factor of 2.41.
Proof (taken from Blum 1996; a similar proof appears in Littlestone and Warmuth 1989): The combined weight of all the experts at the start of the problem is W = n. If the weighted majority algorithm makes a mistake, at least half the total weight of experts predicted incorrectly, so the total weight is reduced by a factor of at least 1⁄4. Thus, after the weighted majority algorithm makes M mistakes, its total weight W has been reduced by at least (3/4)^M. In other words:
W ⇐ n*( (3/4)^M )
But if the best expert has made at most m mistakes, its weight is at least (1/2)^m. And since W includes the weight of all experts,
W >= (1/2)^m
(1/2)^m ⇐ W ⇐ n*( (3/4)^M )
(1/2)^m ⇐ n*( (3/4)^M )
-m ⇐ log2(n) + M*log2(3/4)
-m ⇐ log2(n) + M*-log2(4/3)
M*log2(4/3) ⇐ m + log2(n)
M ⇐ (1/log2(4/3))*(m + log2(n))
M ⇐ 2.41*(m + log2(n))
Blum then says that “we can achieve a better bound than that described above”, by randomizing the algorithm to predict “positive” or “negative” with probability proportional to the weight assigned each prediction. Thus, if 3⁄4 of the expert weight went to “positive”, we would predict “positive” with probability 75%, and “negative” with probability 25%.
An essentially similar proof, summing over the expected probability of a mistake on each round, will show that in this case:
M ⇐ 1.39m + 2 ln(n) (note: M is now an expectation)
Since the penalty applied to particular experts does not depend on the global prediction but only on the actual outcome, most of the proof proceeds as before. We have
W >= (1/2)^m
where again m is the best expert and W is the total weight of all experts.
We also have that W is the starting weight n times the product of (1 − 1⁄2 F_i), where F_i is the fraction of mistaken expert weight on round i:
W = n * Prod_i (1 − 1⁄2 F_i)
And since we predict with probability proportional to the expert weight, the expected number of mistakes is just the sum over F_i:
M = Sum_i F_i
(1/2)^m ⇐ n * Prod_i (1 − 1⁄2 F_i)
-m*ln(2) ⇐ ln(n) + Sum_i ln(1 − 1⁄2 F_i)
Sum_i -ln(1 − 1⁄2 F_i) ⇐ ln(n) + m*ln(2)
Sum_i (1/2 F_i) ⇐ ln(n) + m*ln(2) (because x < -ln(1 - x) )
Sum_i F_i ⇐ 2*(ln(n) + m*ln(2))
M ⇐ 1.39m + 2 ln(n)
Behold, we have done better by randomizing!
We should be especially suspicious that the randomized algorithm guesses with probability proportional to the expert weight assigned. This seems strongly reminiscent of betting with 70% probability on blue, when the environment is a random mix of 70% blue and 30% red cards. We know the best bet—and yet we only sometimes make this best bet, at other times betting on a condition we believe to be less probable.
Yet we thereby prove a smaller upper bound on the expected error. Is there an algebraic error in the second proof? Are we extracting useful work from a noise source? Is our knowledge harming us so much that we can do better through ignorance?
Maybe the unrandomized algorithm fails to take into account the Bayesian value of information, and hence fails to exhibit exploratory behavior? Maybe it doesn’t test unusual cases to see if they might have surprising results?
But, examining the assumptions, we see that the feedback we receive is fixed, regardless of the prediction’s output. Nor does the updating of the expert weights depend on the predictions we output. It doesn’t matter whether we substitute a completely random string of 1s and 0s as our actual output. We get back exactly the same data from the environment, and the weighted majority algorithm updates the expert weights in exactly the same way. So that can’t be the explanation.
Are the component experts doing worse than random, so that by randomizing our predictions, we can creep back up toward maximum entropy? But we didn’t assume anything about how often the component experts were right, or use the fact in the proofs. Therefore the proofs would carry even if we specified that all experts were right at least 60% of the time. It’s hard to see how randomizing our predictions could help, in that case—but the proofs still go through, unchanged.
So where’s the gotcha?
Maybe I’m about to tell you that I looked, and I couldn’t find the gotcha either. What would you believe, in that case? Would you think that the whole thesis on the futility of randomness was probably just wrong—a reasonable-sounding philosophical argument that simply wasn’t correct in practice? Would you despair of being able to follow the math, and give up and shrug, unable to decide who might be right?
We don’t always see the flaws right away, and that’s something to always remember.
In any case, I’m about to start explaining the gotcha. If you want to go back, read the paper, and look yourself, you should stop reading now...
There does exist a rare class of occasions where we want a source of “true” randomness, such as a quantum measurement device. For example, you are playing rock-paper-scissors against an opponent who is smarter than you are, and who knows exactly how you will be making your choices. In this condition it is wise to choose randomly, because any method your opponent can predict will do worse-than-average. Assuming that the enemy knows your source code has strange consequences: the action sequence 1, 2, 1, 3 is good when derived from a quantum noise generator, but bad when derived from any deterministic algorithm, even though it is the same action sequence. Still it is not a totally unrealistic situation. In real life, it is, in fact, a bad idea to play rock-paper-scissors using an algorithm your opponent can predict. So are we, in this situation, deriving optimization from noise?
The random rock-paper-scissors player does not play cleverly, racking up lots of points. It does not win more than 1⁄3 of the time, or lose less than 1⁄3 of the time (on average). The randomized player does better because its alternatives perform poorly, not from being smart itself. Moreover, by assumption, the opponent is an intelligence whom we cannot outsmart and who always knows everything about any method we use. There is no move we can make that does not have a possible countermove. By assumption, our own intelligence is entirely useless. The best we can do is to avoid all regularity that the enemy can exploit. In other words, we do best by minimizing the effectiveness of intelligence within the system-as-a-whole, because only the enemy’s intelligence is allowed to be effective. If we can’t be clever ourselves, we might as well think of ourselves as the environment and the enemy as the sole intelligence within that environment. By becoming the maximum-entropy environment for rock-paper-scissors, we render all intelligence useless, and do (by assumption) the best we can do.
When the environment is adversarial, smarter than you are, and informed about your methods, then in a theoretical sense it may be wise to have a quantum noise source handy. (In a practical sense you’re just screwed.) Again, this is not because we’re extracting useful work from a noise source; it’s because the most powerful intelligence in the system is adversarial, and we’re minimizing the power that intelligence can exert in the system-as-a-whole. We don’t do better-than-average, we merely minimize the extent to which the adversarial intelligence produces an outcome that is worse-than-average (from our perspective).
Similarly, cryptographers have a legitimate interest in strong randomness generators because cryptographers are trying to minimize the effectiveness of an intelligent adversary. Certainly entropy can act as an antidote to intelligence.
Now back to the weighted majority algorithm. Blum (1996) remarks:
“Intuitively, the advantage of the randomized approach is that it dilutes the worst case. Previously, the worst case was that slightly more than half of the total weight predicted incorrectly, causing the algorithm to make a mistake and yet only reduce the total weight by 1⁄4. Now there is roughly a 50⁄50 chance that the [randomized] algorithm will predict correctly in this case, and more generally, the probability that the algorithm makes a mistake is tied to the amount that the weight is reduced.”
From the start, we did our analysis for an upper bound on the number of mistakes made. A global upper bound is no better than the worst individual case; thus, to set a global upper bound we must bound the worst individual case. In the worst case, our environment behaves like an adversarial superintelligence.
Indeed randomness can improve the worst-case scenario, if the worst-case environment is allowed to exploit “deterministic” moves but not “random” ones. It is like an environment that can decide to produce a red card whenever you bet on blue—unless you make the same bet using a “random” number generator instead of your creative intelligence.
Suppose we use a quantum measurement device to produce a string of random ones and zeroes; make two copies of this string; use one copy for the weighted majority algorithm’s random number generator; and give another copy to an intelligent adversary who picks our samples. In other words, we let the weighted majority algorithm make exactly the same randomized predictions, produced by exactly the same generator, but we also let the “worst case” environment know what these randomized predictions will be. Then the improved upper bound of the randomized version is mathematically invalidated.
This shows that the improved upper bound proven for the randomized algorithm did not come from the randomized algorithm making systematically better predictions—doing superior cognitive work, being more intelligent—but because we arbitrarily declared that an intelligent adversary could read our mind in one case but not the other.
This is not just a minor quibble. It leads to the testable prediction that on real-world problems, where the environment is usually not an adversarial telepath, the unrandomized weighted-majority algorithm should do better than the randomized version. (Provided that some component experts outperform maximum entropy—are right more than 50% of the time on binary problems.)
Analyzing the worst-case scenario is standard practice in computational learning theory, but it makes the math do strange things. Moreover, the assumption of the worst case can be subtle; it is not always explicitly labeled. Consider the upper bound for the unrandomized weighted-majority algorithm. I did not use the term “worst case”—I merely wrote down some inequalities. In fact, Blum (1996), when initially introducing this bound, does not at first use the phrase “worst case”. The proof’s conclusion says only:
“Theorem 1. The number of mistakes M made by the Weighted Majority algorithm described above is never more than 2.41(m+lg n), where m is the number of mistakes made by the best expert so far.”
Key word: Never.
The worst-case assumption for the unrandomized algorithm was implicit in calculating the right-hand-side of the inequality by assuming that, on each mistake, the total weight went down by a factor of 1⁄4, and the total weight never decreased after any successful prediction. This is the absolute worst possible performance the weighted-majority algorithm can give.
The assumption implicit in those innocent-looking equations is that the environment carefully maximizes the anguish of the predictor: The environment so cleverly chooses its data samples, that on each case where the weighted-majority algorithm is allowed to be successful, it shall receive not a single iota of useful feedback—not a single expert shall be wrong. And the weighted-majority algorithm will be made to err on sensory cases that produce the minimum possible useful feedback, maliciously fine-tuned to the exact current weighting of experts. We assume that the environment can predict every aspect of the predictor and exploit it—unless the predictor uses a “random” number generator which we arbitrarily declare to be unknowable to the adversary.
What strange assumptions are buried in that innocent little inequality,
Moreover, the entire argument in favor of the randomized algorithm was theoretically suspect from the beginning, because it rested on non-transitive inequalities. If I prove an upper bound on the errors of algorithm X, and then prove a smaller upper bound on the errors of algorithm Y, it does not prove that in the real world Y will perform better than X. For example, I prove that X cannot do worse than 1000 errors, and that Y cannot do worse than 100 errors. Is Y a better algorithm? Not necessarily. Tomorrow I could find an improved proof which shows that X cannot do worse than 90 errors. And then in real life, both X and Y could make exactly 8 errors.
4 ⇐ 1,000,000,000
9 ⇐ 10
But that doesn’t mean that 4 > 9.
So the next time you see an author remark, “We can further improve this algorithm using randomization...” you may not know exactly where to find the oddity. If you’d run across the above-referenced example (or any number of others in the machine-learning literature), you might not have known how to deconstruct the randomized weighted-majority algorithm. If such a thing should happen to you, then I hope that I have given you grounds to suspect that an oddity exists somewhere, even if you cannot find it—just as you would suspect a machine that proposed to extract work from a heat bath, even if you couldn’t keep track of all the gears and pipes.
Nominull put it very compactly when he said that, barring an environment which changes based on the form of your algorithm apart from its output, “By adding randomness to your algorithm, you spread its behaviors out over a particular distribution, and there must be at least one point in that distribution whose expected value is at least as high as the average expected value of the distribution.”
As I remarked in Perpetual Motion Beliefs:
I once knew a fellow who was convinced that his system of wheels and gears would produce reactionless thrust, and he had an Excel spreadsheet that would prove this—which of course he couldn’t show us because he was still developing the system. In classical mechanics, violating Conservation of Momentum is provably impossible. So any Excel spreadsheet calculated according to the rules of classical mechanics must necessarily show that no reactionless thrust exists—unless your machine is complicated enough that you have made a mistake in the calculations.
If you ever find yourself mathematically proving that you can do better by randomizing, I suggest that you suspect your calculations, or suspect your interpretation of your assumptions, before you celebrate your extraction of work from a heat bath.