Everyday Lessons from High-Dimensional Optimization
Suppose you’re designing a bridge. There’s a massive number of variables you can tweak: overall shape, relative positions and connectivity of components, even the dimensions and material of every beam and rivet. Even for a small footbridge, we’re talking about at least thousands of variables. For a large project, millions if not billions. Every one of those is a dimension over which we could, in principle, optimize.
Suppose you have a website, and you want to increase sign-ups. There’s a massive number of variables you can tweak: ad copy/photos/videos, spend distribution across ad channels, home page copy/photos/videos, button sizes and positions, page colors and styling… and every one of those is itself high-dimensional. Every word choice, every color, every position of every button, header, divider, sidebar, box, link… every one of those is a variable, adding up to thousands of dimensions over which to optimize.
Suppose you’re a startup founder planning your day—just a normal workday. There’s a massive number of variables you could tweak: dozens of people you could talk to, dozens of things you could talk to any of them about, and all the possible combinations of people and topics. There’s emails, code, designs and plans you could write. Every choice is a dimension over which you could optimize.
Point being: real-world optimization problems are usually pretty high dimensional.
Unfortunately, many techniques and intuitions which work well for low-dimensional optimization do not scale up well to higher dimensions. This post talks about some of these problems, and what they look like in real life.
Try It and See
Let’s start with a baseline: try something at random, see how well it works. If it doesn’t work, or doesn’t work better than your current best choice, then throw it out and try something else at random.
In low-dimensional problems, this isn’t a bad approach. Want to decide which brand of soap to use? Try it and see. There just aren’t that many possible choices, so you’ll pretty quickly try all of them and settle on the best.
On the other hand, we probably don’t want to design a bridge by creating a design completely at random, checking whether it works, then throwing it out and trying another design at random if it doesn’t.
In general, the number of possible states/designs/configurations in a space increases exponentially with the number of dimensions. A problem in two or three dimensions, with k choices for each variable, will only have k^2 or k^3 possibilities. A problem in a hundred thousand dimensions will have k^100000 - well in excess of the number of electrons in the universe, even if there’s only two choices per variable. The number of possible bridge designs is exponentially huge; selecting designs at random will not ever find the best design, or anything close to it.
Let’s look at a less obvious example.
From Studies on Slack:
Imagine a distant planet full of eyeless animals. Evolving eyes is hard: they need to evolve Eye Part 1, then Eye Part 2, then Eye Part 3, in that order. Each of these requires a separate series of rare mutations.
Here on Earth, scientists believe each of these mutations must have had its own benefits – in the land of the blind, the man with only Eye Part 1 is king. But on this hypothetical alien planet, there is no such luck. You need all three Eye Parts or they’re useless. Worse, each Eye Part is metabolically costly [...]
So these animals will only evolve eyes in conditions of relatively weak evolutionary pressure.
See the mistake?
When evolutionary pressure is low, we explore the space of organism-designs more-or-less at random. Now, if we imagine mutations just stepping left or right on a one-dimensional line, with the three eye parts as milestones along that line, then the picture above makes sense: as long as evolutionary pressure is low, we’ll explore out along the line, and eventually stumble on the eye.
But the real world doesn’t look like that. Evolution operates in a space with (at least) hundreds of thousands of dimensions—every codon in every gene can change, genes/chunks of genes can copy or delete, etc. The “No Eye” state doesn’t have one outgoing arrow, it has hundreds of thousands of outgoing arrows, and “Eye Part 1″ has hundreds of thousands of outgoing arrows, and so forth.
As we move further away from the starting state, the number of possible states increases exponentially. By the time we’re as-far-away as Whole Eye (which involves a whole lot of mutations), the number of possible states will outnumber the atoms in the universe. If evolution is uniformly-randomly exploring that space, it will not ever stumble on the Whole Eye—no matter how weak evolutionary pressure is.
Point is: the hard part of getting from “No Eye” to “Whole Eye” is not the fitness cost in the middle, it’s figuring out which direction to go in a space with hundreds of thousands of directions to choose from at every single step.
Conversely, the weak evolutionary benefits of partial-eyes matter, not because they grant a fitness gain in themselves, but because they bias evolution in the direction toward Whole Eyes. They tell us which way to go in a space with hundreds of thousands of directions.
This same reasoning applies to high-dimensional problem solving in general. Need to design a bridge? A webpage? A profitable company? A better mousetrap? The hard part isn’t obtaining enough slack to build the thing; the hard part is figuring out which direction to go in a space with hundreds of thousands of possible directions to choose from at every single step.
The E-Coli Optimization Algorithm
Here’s a simple but somewhat-less-brute-force optimization algorithm:
Pick a random direction
Move that way
Check whether the thing you’re optimizing is improving
If so, keep going
If not, go some other direction
This algorithm is exactly how e-coli follow chemical gradients to find food. It’s also how lots of real-world problem-solving proceeds: try something, if it helps then do more, if it doesn’t help then try something else. It’s babble and prune with weak heuristics for babble-generation.
This works great in low dimensions. Trying to find the source of a tasty smell or a loud noise? Walk in a random direction, if the smell/noise gets stronger then keep going, otherwise try another direction. There’s only two or three dimensions along which to explore, so you’ll often choose directions which point you close to the source. It works for e-coli, and it works for many other things too.
On the other hand, imagine designing a bridge by starting from a design, changing something at random, checking whether it helps, then keeping the change if it did help or throwing it out otherwise. You might eventually end up with a workable design, but even at best this would be a very slow way to design a bridge.
If you’re a programmer, you’ve probably seen how well it works to write/debug a program by making random changes, keeping them if they help, and throwing them away if not.
In general, e-coli optimization can work as long as there’s a path of local improvements to wherever we want to go. In principle, it’s a lot like gradient descent, except slower: gradient descent always chooses the “locally-best” direction, whereas e-coli just chooses a random direction.
How much slower is e-coli optimization compared to gradient descent? What’s the cost of experimenting with random directions, rather than going in the “best” direction? Well, imagine an inclined plane in n dimensions. There’s exactly one “downhill” direction (the gradient). The n-1 directions perpendicular to the gradient don’t go downhill at all; they’re all just flat. If we take a one-unit step along each of these directions, one after another, then we’ll take n steps, but only 1 step will be downhill. In other words, only ~O(1/n) of our travel-effort is useful; the rest is wasted.
In a two-dimensional space, that means ~50% of effort is wasted. In three dimensions, 70%. In a thousand-dimensional space, ~99.9% of effort is wasted.
Obviously this model is fairly simplistic, but the qualitative conclusion makes at least some sense. When an e-coli swims around looking for food in a three-dimensional puddle of water, it will randomly end up pointed in approximately the right direction reasonably often—there are only three independent directions to pick from, and one of them is right. On the other hand, if we’re designing a bridge and there’s one particular strut which fails under load, then we’d randomly try changing hundreds or thousands of other struts before we tried changing the one which is failing.
This points toward a potentially more efficient approach: if we’re designing a bridge and we can easily identify which strut fails first, then we should change that strut. Note, however, that this requires reasoning about the structure of the system—i.e. the struts—not just treating it as a black box. That’s the big upside of e-coli optimization: we can apply it to black boxes.
Beyond Black Boxes
For low-dimensional systems, try-it-and-see or e-coli optimization work reasonably well, at least in a big-O sense. But in high-dimensional problems, we need to start reasoning about the internal structure of the system in order to solve problems efficiently. We break up the high-dimensional system into a bunch of low-level components, and reason about their interactions with each other. Mathematically, this involves things like:
Causality: if I change a line in a program, how do the effects of that change propagate to the rest?
Constraints: the load on a particular bolt must be below X; what constraints does this imply for the load on adjacent components?
Backpropagation: to which parameter is the system’s overall performance most sensitive?
This sort of reasoning requires an expensive investment in understanding the internal gears of the system, but once we have that understanding, we can use it to achieve better big-O problem-solving performance. It allows us to identify which low-dimensional components matter most, and focus optimization effort on those. We can change the strut which is failing, or the line of code with a bug in it, rather than trying things at random. The more dimensions the system has, the larger the efficiency gain from a gearsy approach.