Everyday Lessons from High-Dimensional Optimization

Sup­pose you’re de­sign­ing a bridge. There’s a mas­sive num­ber of vari­ables you can tweak: over­all shape, rel­a­tive po­si­tions and con­nec­tivity of com­po­nents, even the di­men­sions and ma­te­rial of ev­ery beam and rivet. Even for a small foot­bridge, we’re talk­ing about at least thou­sands of vari­ables. For a large pro­ject, mil­lions if not billions. Every one of those is a di­men­sion over which we could, in prin­ci­ple, op­ti­mize.

Sup­pose you have a web­site, and you want to in­crease sign-ups. There’s a mas­sive num­ber of vari­ables you can tweak: ad copy/​pho­tos/​videos, spend dis­tri­bu­tion across ad chan­nels, home page copy/​pho­tos/​videos, but­ton sizes and po­si­tions, page col­ors and styling… and ev­ery one of those is it­self high-di­men­sional. Every word choice, ev­ery color, ev­ery po­si­tion of ev­ery but­ton, header, di­vider, side­bar, box, link… ev­ery one of those is a vari­able, adding up to thou­sands of di­men­sions over which to op­ti­mize.

Sup­pose you’re a startup founder plan­ning your day—just a nor­mal work­day. There’s a mas­sive num­ber of vari­ables you could tweak: dozens of peo­ple you could talk to, dozens of things you could talk to any of them about, and all the pos­si­ble com­bi­na­tions of peo­ple and top­ics. There’s emails, code, de­signs and plans you could write. Every choice is a di­men­sion over which you could op­ti­mize.

Point be­ing: real-world op­ti­miza­tion prob­lems are usu­ally pretty high di­men­sional.

Un­for­tu­nately, many tech­niques and in­tu­itions which work well for low-di­men­sional op­ti­miza­tion do not scale up well to higher di­men­sions. This post talks about some of these prob­lems, and what they look like in real life.

Try It and See

Let’s start with a baseline: try some­thing at ran­dom, see how well it works. If it doesn’t work, or doesn’t work bet­ter than your cur­rent best choice, then throw it out and try some­thing else at ran­dom.

In low-di­men­sional prob­lems, this isn’t a bad ap­proach. Want to de­cide which brand of soap to use? Try it and see. There just aren’t that many pos­si­ble choices, so you’ll pretty quickly try all of them and set­tle on the best.

On the other hand, we prob­a­bly don’t want to de­sign a bridge by cre­at­ing a de­sign com­pletely at ran­dom, check­ing whether it works, then throw­ing it out and try­ing an­other de­sign at ran­dom if it doesn’t.

In gen­eral, the num­ber of pos­si­ble states/​de­signs/​con­figu­ra­tions in a space in­creases ex­po­nen­tially with the num­ber of di­men­sions. A prob­lem in two or three di­men­sions, with k choices for each vari­able, will only have k^2 or k^3 pos­si­bil­ities. A prob­lem in a hun­dred thou­sand di­men­sions will have k^100000 - well in ex­cess of the num­ber of elec­trons in the uni­verse, even if there’s only two choices per vari­able. The num­ber of pos­si­ble bridge de­signs is ex­po­nen­tially huge; se­lect­ing de­signs at ran­dom will not ever find the best de­sign, or any­thing close to it.

Let’s look at a less ob­vi­ous ex­am­ple.

From Stud­ies on Slack:

Imag­ine a dis­tant planet full of eye­less an­i­mals. Evolv­ing eyes is hard: they need to evolve Eye Part 1, then Eye Part 2, then Eye Part 3, in that or­der. Each of these re­quires a sep­a­rate se­ries of rare mu­ta­tions.

Here on Earth, sci­en­tists be­lieve each of these mu­ta­tions must have had its own benefits – in the land of the blind, the man with only Eye Part 1 is king. But on this hy­po­thet­i­cal alien planet, there is no such luck. You need all three Eye Parts or they’re use­less. Worse, each Eye Part is metabol­i­cally costly [...]

So these an­i­mals will only evolve eyes in con­di­tions of rel­a­tively weak evolu­tion­ary pres­sure.

See the mis­take?

When evolu­tion­ary pres­sure is low, we ex­plore the space of or­ganism-de­signs more-or-less at ran­dom. Now, if we imag­ine mu­ta­tions just step­ping left or right on a one-di­men­sional line, with the three eye parts as mile­stones along that line, then the pic­ture above makes sense: as long as evolu­tion­ary pres­sure is low, we’ll ex­plore out along the line, and even­tu­ally stum­ble on the eye.

But the real world doesn’t look like that. Evolu­tion op­er­ates in a space with (at least) hun­dreds of thou­sands of di­men­sions—ev­ery codon in ev­ery gene can change, genes/​chunks of genes can copy or delete, etc. The “No Eye” state doesn’t have one out­go­ing ar­row, it has hun­dreds of thou­sands of out­go­ing ar­rows, and “Eye Part 1″ has hun­dreds of thou­sands of out­go­ing ar­rows, and so forth.

As we move fur­ther away from the start­ing state, the num­ber of pos­si­ble states in­creases ex­po­nen­tially. By the time we’re as-far-away as Whole Eye (which in­volves a whole lot of mu­ta­tions), the num­ber of pos­si­ble states will out­num­ber the atoms in the uni­verse. If evolu­tion is uniformly-ran­domly ex­plor­ing that space, it will not ever stum­ble on the Whole Eye—no mat­ter how weak evolu­tion­ary pres­sure is.

Point is: the hard part of get­ting from “No Eye” to “Whole Eye” is not the fit­ness cost in the mid­dle, it’s figur­ing out which di­rec­tion to go in a space with hun­dreds of thou­sands of di­rec­tions to choose from at ev­ery sin­gle step.

Con­versely, the weak evolu­tion­ary benefits of par­tial-eyes mat­ter, not be­cause they grant a fit­ness gain in them­selves, but be­cause they bias evolu­tion in the di­rec­tion to­ward Whole Eyes. They tell us which way to go in a space with hun­dreds of thou­sands of di­rec­tions.

This same rea­son­ing ap­plies to high-di­men­sional prob­lem solv­ing in gen­eral. Need to de­sign a bridge? A web­page? A prof­itable com­pany? A bet­ter mouse­trap? The hard part isn’t ob­tain­ing enough slack to build the thing; the hard part is figur­ing out which di­rec­tion to go in a space with hun­dreds of thou­sands of pos­si­ble di­rec­tions to choose from at ev­ery sin­gle step.

The E-Coli Op­ti­miza­tion Algorithm

Here’s a sim­ple but some­what-less-brute-force op­ti­miza­tion al­gorithm:

  • Pick a ran­dom direction

  • Move that way

  • Check whether the thing you’re op­ti­miz­ing is improving

  • If so, keep going

  • If not, go some other direction

This al­gorithm is ex­actly how e-coli fol­low chem­i­cal gra­di­ents to find food. It’s also how lots of real-world prob­lem-solv­ing pro­ceeds: try some­thing, if it helps then do more, if it doesn’t help then try some­thing else. It’s bab­ble and prune with weak heuris­tics for bab­ble-gen­er­a­tion.

This works great in low di­men­sions. Try­ing to find the source of a tasty smell or a loud noise? Walk in a ran­dom di­rec­tion, if the smell/​noise gets stronger then keep go­ing, oth­er­wise try an­other di­rec­tion. There’s only two or three di­men­sions along which to ex­plore, so you’ll of­ten choose di­rec­tions which point you close to the source. It works for e-coli, and it works for many other things too.

On the other hand, imag­ine de­sign­ing a bridge by start­ing from a de­sign, chang­ing some­thing at ran­dom, check­ing whether it helps, then keep­ing the change if it did help or throw­ing it out oth­er­wise. You might even­tu­ally end up with a work­able de­sign, but even at best this would be a very slow way to de­sign a bridge.

If you’re a pro­gram­mer, you’ve prob­a­bly seen how well it works to write/​de­bug a pro­gram by mak­ing ran­dom changes, keep­ing them if they help, and throw­ing them away if not.

In gen­eral, e-coli op­ti­miza­tion can work as long as there’s a path of lo­cal im­prove­ments to wher­ever we want to go. In prin­ci­ple, it’s a lot like gra­di­ent de­scent, ex­cept slower: gra­di­ent de­scent always chooses the “lo­cally-best” di­rec­tion, whereas e-coli just chooses a ran­dom di­rec­tion.

How much slower is e-coli op­ti­miza­tion com­pared to gra­di­ent de­scent? What’s the cost of ex­per­i­ment­ing with ran­dom di­rec­tions, rather than go­ing in the “best” di­rec­tion? Well, imag­ine an in­clined plane in n di­men­sions. There’s ex­actly one “down­hill” di­rec­tion (the gra­di­ent). The n-1 di­rec­tions per­pen­dicu­lar to the gra­di­ent don’t go down­hill at all; they’re all just flat. If we take a one-unit step along each of these di­rec­tions, one af­ter an­other, then we’ll take n steps, but only 1 step will be down­hill. In other words, only ~O(1/​n) of our travel-effort is use­ful; the rest is wasted.

In a two-di­men­sional space, that means ~50% of effort is wasted. In three di­men­sions, 70%. In a thou­sand-di­men­sional space, ~99.9% of effort is wasted.

Ob­vi­ously this model is fairly sim­plis­tic, but the qual­i­ta­tive con­clu­sion makes at least some sense. When an e-coli swims around look­ing for food in a three-di­men­sional pud­dle of wa­ter, it will ran­domly end up pointed in ap­prox­i­mately the right di­rec­tion rea­son­ably of­ten—there are only three in­de­pen­dent di­rec­tions to pick from, and one of them is right. On the other hand, if we’re de­sign­ing a bridge and there’s one par­tic­u­lar strut which fails un­der load, then we’d ran­domly try chang­ing hun­dreds or thou­sands of other struts be­fore we tried chang­ing the one which is failing.

This points to­ward a po­ten­tially more effi­cient ap­proach: if we’re de­sign­ing a bridge and we can eas­ily iden­tify which strut fails first, then we should change that strut. Note, how­ever, that this re­quires rea­son­ing about the struc­ture of the sys­tem—i.e. the struts—not just treat­ing it as a black box. That’s the big up­side of e-coli op­ti­miza­tion: we can ap­ply it to black boxes.

Beyond Black Boxes

For low-di­men­sional sys­tems, try-it-and-see or e-coli op­ti­miza­tion work rea­son­ably well, at least in a big-O sense. But in high-di­men­sional prob­lems, we need to start rea­son­ing about the in­ter­nal struc­ture of the sys­tem in or­der to solve prob­lems effi­ciently. We break up the high-di­men­sional sys­tem into a bunch of low-level com­po­nents, and rea­son about their in­ter­ac­tions with each other. Math­e­mat­i­cally, this in­volves things like:

  • Causal­ity: if I change a line in a pro­gram, how do the effects of that change prop­a­gate to the rest?

  • Con­straints: the load on a par­tic­u­lar bolt must be be­low X; what con­straints does this im­ply for the load on ad­ja­cent com­po­nents?

  • Back­prop­a­ga­tion: to which pa­ram­e­ter is the sys­tem’s over­all perfor­mance most sen­si­tive?

This sort of rea­son­ing re­quires an ex­pen­sive in­vest­ment in un­der­stand­ing the in­ter­nal gears of the sys­tem, but once we have that un­der­stand­ing, we can use it to achieve bet­ter big-O prob­lem-solv­ing perfor­mance. It al­lows us to iden­tify which low-di­men­sional com­po­nents mat­ter most, and fo­cus op­ti­miza­tion effort on those. We can change the strut which is failing, or the line of code with a bug in it, rather than try­ing things at ran­dom. The more di­men­sions the sys­tem has, the larger the effi­ciency gain from a gearsy ap­proach.