Worse Than Random

Pre­vi­ously in se­ries: Lawful Uncertainty

You may have no­ticed a cer­tain trend in re­cent posts: I’ve been ar­gu­ing that ran­dom­ness hath no power, that there is no beauty in en­tropy, nor yet strength from noise.

If one were to for­mal­ize the ar­gu­ment, it would prob­a­bly run some­thing like this: that if you define op­ti­miza­tion as pre­vi­ously sug­gested, then sheer ran­dom­ness will gen­er­ate some­thing that seems to have 12 bits of op­ti­miza­tion, only by try­ing 4096 times; or 100 bits of op­ti­miza­tion, only by try­ing 1030 times.

This may not sound like a profound in­sight, since it is true by defi­ni­tion. But con­sider—how many comic books talk about “mu­ta­tion” as if it were a source of power? Mu­ta­tion is ran­dom. It’s the se­lec­tion part, not the mu­ta­tion part, that ex­plains the trends of evolu­tion.

Or you may have heard peo­ple talk­ing about “emer­gence” as if it could ex­plain com­plex, func­tional or­ders. Peo­ple will say that the func­tion of an ant colony emerges—as if, start­ing from ants that had been se­lected only to func­tion as soli­tary in­di­vi­d­u­als, the ants got to­gether in a group for the first time and the ant colony popped right out. But ant colonies have been se­lected on as colonies by evolu­tion. Op­ti­miza­tion didn’t just mag­i­cally hap­pen when the ants came to­gether.

And you may have heard that cer­tain al­gorithms in Ar­tifi­cial In­tel­li­gence work bet­ter when we in­ject ran­dom­ness into them.

Is that even pos­si­ble? How can you ex­tract use­ful work from en­tropy?

But it is pos­si­ble in the­ory, since you can have things that are anti-op­ti­mized. Say, the av­er­age state has util­ity −10, but the cur­rent state has an un­usu­ally low util­ity of −100. So in this case, a ran­dom jump has an ex­pected benefit. If you hap­pen to be stand­ing in the mid­dle of a lava pit, run­ning around at ran­dom is bet­ter than stay­ing in the same place. (Not best, but bet­ter.)

A given AI al­gorithm can do bet­ter when ran­dom­ness is in­jected, pro­vided that some step of the un­ran­dom­ized al­gorithm is do­ing worse than ran­dom.

Imag­ine that we’re try­ing to solve a push­but­ton com­bi­na­tion lock with 20 num­bers and four steps − 160,000 pos­si­ble com­bi­na­tions. And we try the fol­low­ing al­gorithm for open­ing it:

  1. En­ter 0-0-0-0 into the lock.

  2. If the lock opens, re­turn with SUCCESS.

  3. If the lock re­mains closed, go to step 1.

Ob­vi­ously we can im­prove this al­gorithm by sub­sti­tut­ing “En­ter a ran­dom com­bi­na­tion” on the first step.

If we were to try and ex­plain in words why this works, a de­scrip­tion might go some­thing like this: “When we first try 0-0-0-0 it has the same chance of work­ing (so far as we know) as any other com­bi­na­tion. But if it doesn’t work, it would be stupid to try it again, be­cause now we know that 0-0-0-0 doesn’t work.”

The first key idea is that, af­ter try­ing 0-0-0-0, we learn some­thing—we ac­quire new knowl­edge, which should then af­fect how we plan to con­tinue from there. This is knowl­edge, quite a differ­ent thing from ran­dom­ness...

What ex­actly have we learned? We’ve learned that 0-0-0-0 doesn’t work; or to put it an­other way, given that 0-0-0-0 failed on the first try, the con­di­tional prob­a­bil­ity of it work­ing on the sec­ond try, is neg­ligible.

Con­sider your prob­a­bil­ity dis­tri­bu­tion over all the pos­si­ble com­bi­na­tions: Your prob­a­bil­ity dis­tri­bu­tion starts out in a state of max­i­mum en­tropy, with all 160,000 com­bi­na­tions hav­ing a 1160,000 prob­a­bil­ity of work­ing. After you try 0-0-0-0, you have a new prob­a­bil­ity dis­tri­bu­tion, which has slightly less en­tropy; 0-0-0-0 has an in­finites­i­mal prob­a­bil­ity of work­ing, and the re­main­ing 159,999 pos­si­bil­ities each have a 1159,999 prob­a­bil­ity of work­ing. To try 0-0-0-0 again would now be stupid—the ex­pected util­ity of try­ing 0-0-0-0 is less than av­er­age; the vast ma­jor­ity of po­ten­tial ac­tions now have higher ex­pected util­ity than does 0-0-0-0. An al­gorithm that tries 0-0-0-0 again would do worse than ran­dom, and we can im­prove the al­gorithm by ran­dom­iz­ing it.

One may also con­sider an al­gorithm as a se­quence of tries: The “un­ran­dom­ized al­gorithm” de­scribes the se­quence of tries 0-0-0-0, 0-0-0-0, 0-0-0-0… and this se­quence of tries is a spe­cial se­quence that has far-be­low-av­er­age ex­pected util­ity in the space of all pos­si­ble se­quences. Thus we can im­prove on this se­quence by se­lect­ing a ran­dom se­quence in­stead.

Or imag­ine that the com­bi­na­tion changes ev­ery sec­ond. In this case, 0-0-0-0, 0-0-0-0 is just as good as the ran­dom­ized al­gorithm—no bet­ter and no worse. What this shows you is that the sup­pos­edly “ran­dom” al­gorithm is “bet­ter” rel­a­tive to a known reg­u­lar­ity of the lock—that the com­bi­na­tion is con­stant on each try. Or to be pre­cise, the rea­son the ran­dom al­gorithm does pre­dictably bet­ter than the stupid one is that the stupid al­gorithm is “stupid” rel­a­tive to a known reg­u­lar­ity of the lock.

In other words, in or­der to say that the ran­dom al­gorithm is an “im­prove­ment”, we must have used spe­cific knowl­edge about the lock to re­al­ize that the un­ran­dom­ized al­gorithm is worse-than-av­er­age. Hav­ing re­al­ized this, can we re­flect fur­ther on our in­for­ma­tion, and take full ad­van­tage of our knowl­edge to do bet­ter-than-av­er­age?

The ran­dom lock­picker is still not op­ti­mal—it does not take full ad­van­tage of the knowl­edge we have ac­quired. A ran­dom al­gorithm might ran­domly try 0-0-0-0 again; it’s not im­pos­si­ble, but it could hap­pen. The longer the ran­dom al­gorithm runs, the more likely it is to try the same com­bi­na­tion twice; and if the ran­dom al­gorithm is suffi­ciently un­lucky, it might still fail to solve the lock af­ter mil­lions of tries. We can take full ad­van­tage of all our knowl­edge by us­ing an al­gorithm that sys­tem­at­i­cally tries 0-0-0-0, 0-0-0-1, 0-0-0-2… This al­gorithm is guaran­teed not to re­peat it­self, and will find the solu­tion in bounded time. Con­sid­er­ing the al­gorithm as a se­quence of tries, no other se­quence in se­quence-space is ex­pected to do bet­ter, given our ini­tial knowl­edge. (Any other non­re­peat­ing se­quence is equally good; but non­re­peat­ing se­quences are rare in the space of all pos­si­ble se­quences.)

A com­bi­na­tion dial of­ten has a tol­er­ance of 2 in ei­ther di­rec­tion. 20-45-35 will open a lock set to 22-44-33. In this case, the al­gorithm that tries 0-1-0, 0-2-0, et cetera, ends up be­ing stupid again; a ran­dom­ized al­gorithm will (usu­ally) work bet­ter. But an al­gorithm that tries 0-5-0, 0-10-0, 0-10-5, will work bet­ter still.

Some­times it is too ex­pen­sive to take ad­van­tage of all the knowl­edge that we could, in the­ory, ac­quire from pre­vi­ous tests. More­over, a com­plete enu­mer­a­tion or in­ter­val-skip­ping al­gorithm would still end up be­ing stupid. In this case, com­puter sci­en­tists of­ten use a cheap pseudo-ran­dom al­gorithm, be­cause the com­pu­ta­tional cost of us­ing our knowl­edge ex­ceeds the benefit to be gained from us­ing it. This does not show the power of ran­dom­ness, but, rather, the pre­dictable stu­pidity of cer­tain spe­cific de­ter­minis­tic al­gorithms on that par­tic­u­lar prob­lem. Re­mem­ber, the pseudo-ran­dom al­gorithm is also de­ter­minis­tic! But the de­ter­minis­tic pseudo-ran­dom al­gorithm doesn’t be­long to the class of al­gorithms that are pre­dictably stupid (do much worse than av­er­age).

There are also sub­tler ways for adding noise to im­prove al­gorithms. For ex­am­ple, there are neu­ral net­work train­ing al­gorithms that work bet­ter if you simu­late noise in the neu­rons. On this oc­ca­sion it is es­pe­cially tempt­ing to say some­thing like:

“Lo! When we make our ar­tifi­cial neu­rons noisy, just like biolog­i­cal neu­rons, they work bet­ter! Be­hold the heal­ing life-force of en­tropy!”

What might ac­tu­ally be hap­pen­ing—for ex­am­ple—is that the net­work train­ing al­gorithm, op­er­at­ing on noise­less neu­rons, would vastly overfit the data.

If you ex­pose the noise­less net­work to the se­ries of coin­flips “HTTTHHTTH”… the train­ing al­gorithm will say the equiv­a­lent of, “I bet this coin was spe­cially de­signed to pro­duce HTTTHHTTH ev­ery time it’s flipped!” in­stead of “This coin prob­a­bly al­ter­nates ran­domly be­tween heads and tails.” A hy­poth­e­sis overfit­ted to the data does not gen­er­al­ize. On the other hand, when we add noise to the neu­rons and then try train­ing them again, they can no longer fit the data pre­cisely, so in­stead they set­tle into a sim­pler hy­poth­e­sis like “This coin al­ter­nates ran­domly be­tween heads and tails.” Note that this re­quires us—the pro­gram­mers—to know in ad­vance that prob­a­bil­is­tic hy­pothe­ses are more likely to be true.

Or here’s an­other way of look­ing at it: A neu­ral net­work al­gorithm typ­i­cally looks at a set of train­ing in­stances, tweaks the units and their con­nec­tions based on the train­ing in­stances, and in this way tries to “stamp” the ex­pe­rience onto it­self. But the al­gorithms which do the stamp­ing are of­ten poorly un­der­stood, and it is pos­si­ble to stamp too hard. If this mis­take has already been made, then blur­ring the sen­sory in­for­ma­tion, or blur­ring the train­ing al­gorithm, or blur­ring the units, can par­tially can­cel out the “over­learn­ing”.

Here’s a sim­plified ex­am­ple of a similar (but not iden­ti­cal) case: Imag­ine that the en­vi­ron­ment deals us a ran­dom mix of cards, 70% blue and 30% red. But in­stead of just pre­dict­ing “blue” or “red”, we have to as­sign a quan­ti­ta­tive prob­a­bil­ity to see­ing blue—and the scor­ing rule for our perfor­mance is one that elic­its an hon­est es­ti­mate; if the ac­tual fre­quency is 70% blue cards, we do best by re­ply­ing “70%”, not 60% or 80%. (“Proper scor­ing rule.”)

If you don’t know in ad­vance the fre­quency of blue and red cards, one way to han­dle the prob­lem would be to have a blue unit and a red unit, both wired to an out­put unit. The blue unit sends sig­nals with a pos­i­tive effect that make the tar­get unit more likely to fire; the red unit in­hibits its tar­gets—just like the ex­ci­ta­tory and in­hibitory synapses in the hu­man brain! (Or an earth­worm’s brain, for that mat­ter...)

Each time we see a blue card in the train­ing data, the train­ing al­gorithm in­creases the strength of the (ex­ci­ta­tory) synapse from the blue unit to the out­put unit; and each time we see a red card, the train­ing al­gorithm strength­ens the (in­hibitory) synapse from the red unit to the out­put unit.

But wait! We haven’t said why the blue or red units might fire in the first place. So we’ll add two more ex­ci­ta­tory units that spike ran­domly, one con­nected to the blue unit and one con­nected to red unit. This simu­lates the nat­u­ral back­ground noise pre­sent in the hu­man brain (or an earth­worm’s brain).

Fi­nally, the spik­ing fre­quency of the out­put unit be­comes the pre­dicted prob­a­bil­ity that the next card is blue.

As you can see—as­sum­ing you haven’t lost track of all the com­plex­ity—this neu­ral net­work learns to pre­dict whether blue cards or red cards are more com­mon in the mix. Just like a hu­man brain!

At least that’s the the­ory. How­ever, when we boot up the neu­ral net­work and give it a hun­dred train­ing ex­am­ples with 70 blue and 30 red cards, it ends up pre­dict­ing a 90% prob­a­bil­ity that each card will be blue. Now, there are sev­eral things that could go wrong with a sys­tem this com­plex; but my own first im­pulse would be to guess that the train­ing al­gorithm is too strongly ad­just­ing the synap­tic weight from the blue or red unit to the out­put unit on each train­ing in­stance. The train­ing al­gorithm needs to shift a lit­tle less far—al­ter the synap­tic weights a lit­tle less.

But the pro­gram­mer of the neu­ral net­work comes up with a differ­ent hy­poth­e­sis: the prob­lem is that there’s no noise in the in­put. This is biolog­i­cally un­re­al­is­tic; real or­ganisms do not have perfect vi­sion or perfect in­for­ma­tion about the en­vi­ron­ment. So the pro­gram­mer shuffles a few ran­domly gen­er­ated blue and red cards (50% prob­a­bil­ity of each) into the train­ing se­quences. Then the pro­gram­mer ad­justs the noise level un­til the net­work fi­nally ends up pre­dict­ing blue with 70% prob­a­bil­ity. And it turns out that us­ing al­most the same noise level (with just a bit of fur­ther tweak­ing), the im­proved train­ing al­gorithm can learn to as­sign the right prob­a­bil­ity to se­quences of 80% blue or 60% blue cards.

Suc­cess! We have found the Right Amount of Noise.

Of course this suc­cess comes with cer­tain dis­ad­van­tages. For ex­am­ple, maybe the blue and red cards are pre­dictable, in the sense of com­ing in a learn­able se­quence. Maybe the se­quence is 7 blue cards fol­lowed by 3 red cards. If we mix noise into the sen­sory data, we may never no­tice this im­por­tant reg­u­lar­ity, or learn it im­perfectly… but that’s the price you pay for biolog­i­cal re­al­ism.

What’s re­ally hap­pen­ing is that the train­ing al­gorithm moves too far given its data, and adulter­at­ing noise with the data diminishes the im­pact of the data. The two er­rors par­tially can­cel out, but at the cost of a non­triv­ial loss in what we could, in prin­ci­ple, have learned from the data. It would be bet­ter to ad­just the train­ing al­gorithm and keep the data clean.

This is an ex­tremely over­sim­plified ex­am­ple, but it is not a to­tal straw­man. The sce­nario seems silly only be­cause it is sim­plified to the point where you can clearly see what is go­ing wrong. Make the neu­ral net­work a lot more com­pli­cated, and the solu­tion of adding noise to the in­puts might sound a lot more plau­si­ble. While some neu­ral net­work al­gorithms are well-un­der­stood math­e­mat­i­cally, oth­ers are not. In par­tic­u­lar, sys­tems crafted with the aim of biolog­i­cal re­al­ism are of­ten not well-un­der­stood.

But it is an in­her­ently odd propo­si­tion that you can get a bet­ter pic­ture of the en­vi­ron­ment by adding noise to your sen­sory in­for­ma­tion—by de­liber­ately throw­ing away your sen­sory acu­ity. This can only de­grade the mu­tual in­for­ma­tion be­tween your­self and the en­vi­ron­ment. It can only diminish what in prin­ci­ple can be ex­tracted from the data. And this is as true for ev­ery step of the com­pu­ta­tion, as it is for the eyes them­selves. The only way that adding ran­dom noise will help is if some step of the sen­sory pro­cess­ing is do­ing worse than ran­dom.

Now it is cer­tainly pos­si­ble to de­sign an im­perfect rea­soner that only works in the pres­ence of an ac­cus­tomed noise level. Biolog­i­cal sys­tems are un­able to avoid noise, and there­fore adapt to over­come noise. Sub­tract the noise, and mechanisms adapted to the pres­ence of noise may do strange things.

Biolog­i­cal sys­tems are of­ten frag­ile un­der con­di­tions that have no evolu­tion­ary prece­dent in the an­ces­tral en­vi­ron­ment. If some­how the Earth’s grav­ity de­creased, then birds might be­come un­sta­ble, lurch­ing up in the air as their wings over­com­pen­sated for the now-de­creased grav­ity. But this doesn’t mean that stronger grav­ity helps birds fly bet­ter. Grav­ity is still the difficult challenge that a bird’s wings work to over­come—even though birds are now adapted to grav­ity as an in­var­i­ant.

What about hill-climb­ing, simu­lated an­neal­ing, or ge­netic al­gorithms? Th­ese AI al­gorithms are lo­cal search tech­niques that ran­domly in­ves­ti­gate some of their near­est neigh­bors. If an in­ves­ti­gated neigh­bor is su­pe­rior to the cur­rent po­si­tion, the al­gorithm jumps there. (Or some­times prob­a­bil­is­ti­cally jumps to a neigh­bor with prob­a­bil­ity de­ter­mined by the differ­ence be­tween neigh­bor good­ness and cur­rent good­ness.) Are these tech­niques draw­ing on the power of noise?

Lo­cal search al­gorithms take ad­van­tage of the reg­u­lar­ity of the search space—that if you find a good point in the search space, its neigh­bor­hood of closely similar points is a likely place to search for a slightly bet­ter neigh­bor. And then this neigh­bor, in turn, is a likely place to search for a still bet­ter neigh­bor; and so on. To the ex­tent this reg­u­lar­ity of the search space breaks down, hill-climb­ing al­gorithms will perform poorly. If the neigh­bors of a good point are no more likely to be good than ran­domly se­lected points, then a hill-climb­ing al­gorithm sim­ply won’t work.

But still, doesn’t a lo­cal search al­gorithm need to make ran­dom changes to the cur­rent point in or­der to gen­er­ate neigh­bors for eval­u­a­tion? Not nec­es­sar­ily; some lo­cal search al­gorithms sys­tem­at­i­cally gen­er­ate all pos­si­ble neigh­bors, and se­lect the best one. Th­ese greedy al­gorithms work fine for some prob­lems, but on other prob­lems it has been found that greedy lo­cal al­gorithms get stuck in lo­cal min­ima. The next step up from greedy lo­cal al­gorithms, in terms of added ran­dom­ness, is ran­dom-restart hill-climb­ing—as soon as we find a lo­cal max­i­mum, we restart some­place ran­dom, and re­peat this pro­cess a num­ber of times. For our fi­nal solu­tion, we re­turn the best lo­cal max­i­mum found when time runs out. Ran­dom-restart hill-climb­ing is sur­pris­ingly use­ful; it can eas­ily solve some prob­lem classes where any in­di­vi­d­ual start­ing point is un­likely to lead to a global max­i­mum or ac­cept­able solu­tion, but it is likely that at least one of a thou­sand in­di­vi­d­ual start­ing points will lead to the global max­i­mum or ac­cept­able solu­tion.

The non-ran­domly-restart­ing, greedy, lo­cal-max­i­mum-grab­bing al­gorithm, is “stupid” at the stage where it gets stuck in a lo­cal max­i­mum. Once you find a lo­cal max­i­mum, you know you’re not go­ing to do bet­ter by greedy lo­cal search—so you may as well try some­thing else with your time. Pick­ing a ran­dom point and start­ing again is dras­tic, but it’s not as stupid as search­ing the neigh­bors of a par­tic­u­lar lo­cal max­i­mum over and over again. (Biolog­i­cal species of­ten do get stuck in lo­cal op­tima. Evolu­tion, be­ing un­in­tel­li­gent, has no mind with which to “no­tice” when it is test­ing the same gene com­plexes over and over.)

Even more stupid is pick­ing a par­tic­u­lar start­ing point, and then eval­u­at­ing its fit­ness over and over again, with­out even search­ing its neigh­bors. This is the lock­picker who goes on try­ing 0-0-0-0 for­ever.

Hill-climb­ing search is not so much a lit­tle bit ran­dom­ized com­pared to the com­pletely stupid lock­picker, as al­most en­tirely non­ran­dom­ized com­pared to a com­pletely ig­no­rant searcher. We search only the lo­cal neigh­bor­hood, rather than se­lect­ing a ran­dom point from the en­tire state space. That prob­a­bil­ity dis­tri­bu­tion has been nar­rowed enor­mously, rel­a­tive to the over­all state space. This ex­ploits the be­lief—the knowl­edge, if the be­lief is cor­rect—that a good point prob­a­bly has good neigh­bors.

You can imag­ine split­ting a hill-climb­ing al­gorithm into com­po­nents that are “de­ter­minis­tic” (or rather, knowl­edge-ex­ploit­ing) and “ran­dom­ized” (the lef­tover ig­no­rance).

A pro­gram­mer writ­ing a prob­a­bil­is­tic hill-climber will use some for­mula to as­sign prob­a­bil­ities to each neigh­bor, as a func­tion of the neigh­bor’s fit­ness. For ex­am­ple, a neigh­bor with a fit­ness of 60 might have prob­a­bil­ity 80% of be­ing se­lected, while other neigh­bors with fit­nesses of 55, 52, and 40 might have se­lec­tion prob­a­bil­ities of 10%, 9%, and 1%. The pro­gram­mer writes a de­ter­minis­tic al­gorithm, a fixed for­mula, that pro­duces these num­bers − 80, 10, 9, and 1.

What about the ac­tual job of mak­ing a ran­dom se­lec­tion at these prob­a­bil­ities? Usu­ally the pro­gram­mer will hand that job off to some­one else’s pseudo-ran­dom al­gorithm—most lan­guage’s stan­dard libraries con­tain a stan­dard pseudo-ran­dom al­gorithm; there’s no need to write your own.

If the hill-climber doesn’t seem to work well, the pro­gram­mer tweaks the de­ter­minis­tic part of the al­gorithm, the part that as­signs these fixed num­bers 80, 10, 9, and 1. The pro­gram­mer does not say—“I bet these prob­a­bil­ities are right, but I need a source that’s even more ran­dom, like a ther­mal noise gen­er­a­tor, in­stead of this merely pseudo-ran­dom al­gorithm that is ul­ti­mately de­ter­minis­tic!” The pro­gram­mer does not go in search of bet­ter noise.

It is the­o­ret­i­cally pos­si­ble for a poorly de­signed “pseudo-ran­dom al­gorithm” to be stupid rel­a­tive to the search space; for ex­am­ple, it might always jump in the same di­rec­tion. But the “pseudo-ran­dom al­gorithm” has to be re­ally shoddy for that to hap­pen. You’re only likely to get stuck with that prob­lem if you rein­vent the wheel in­stead of us­ing a stan­dard, off-the-shelf solu­tion. A de­cent pseudo-ran­dom al­gorithm works just as well as a ther­mal noise source on op­ti­miza­tion prob­lems. It is pos­si­ble (though difficult) for an ex­cep­tion­ally poor noise source to be ex­cep­tion­ally stupid on the prob­lem, but you can­not do ex­cep­tion­ally well by find­ing a noise source that is ex­cep­tion­ally ran­dom. The power comes from the knowl­edge—the de­ter­minis­tic for­mula that as­signs a fixed prob­a­bil­ity dis­tri­bu­tion. It does not reside in the re­main­ing ig­no­rance.

And that’s why I always say that the power of nat­u­ral se­lec­tion comes from the se­lec­tion part, not the mu­ta­tion part.

As a gen­eral prin­ci­ple, on any prob­lem for which you know that a par­tic­u­lar un­ran­dom­ized al­gorithm is un­usu­ally stupid—so that a ran­dom­ized al­gorithm seems wiser—you should be able to use the same knowl­edge to pro­duce a su­pe­rior de­ran­dom­ized al­gorithm. If noth­ing else seems ob­vi­ous, just avoid out­puts that look “un­ran­dom­ized”! If you know some­thing is stupid, de­liber­ately avoid it! (There are ex­cep­tions to this rule, but they have to do with defeat­ing cryp­to­graphic ad­ver­saries—that is, pre­vent­ing some­one else’s in­tel­li­gence from work­ing on you. Cer­tainly en­tropy can act as an an­ti­dote to in­tel­li­gence!) And of course there are very com­mon prac­ti­cal ex­cep­tions when­ever the com­pu­ta­tional cost of us­ing all our knowl­edge ex­ceeds the marginal benefit...

Still you should find, as a gen­eral prin­ci­ple, that ran­dom­ness hath no power: there is no beauty in en­tropy, nor strength from noise.