Theory and Data as Constraints

There’s a widely-known legend about statistician Abraham Wald’s work on planes during WWII (the veracity of the legend is examined here). As the story goes, the military collected data on planes coming back from missions, marking the location of any bullet holes. They soon had statistics showing how many had been hit on the engine, the fuel system, etc. Based on this data, commanders wanted to add extra armor to reinforce the areas which were most often hit.

Wald, however, suggested adding extra armor to the places which were least often hit. His reasoning: planes hit in those areas were the planes which didn’t come back.

The moral of the tale: useful insights do not come from data alone. Background knowledge, interpretation and model-building—all of which we’ll bundle into the word “theory”—are necessary elements as well. In the framework of this sequence: theory and data are both constraints on the production of useful insights. Immediate question: how taut are each of those constraints? What’s the limiting factor?

Let’s think about the tautness of each constraint in the Wald legend. How expensive was the data, and how expensive was the theory? In this case, the data was presumably far more expensive: it required people on the ground examining hundreds of returning airplanes per mission, collecting all of that hand-written data, adding it all up by hand, and relaying it via telephone or radio or mail to the Statistical Research Group in New York City (where Wald worked). The theory, on the other hand, probably took Wald a day at most. Of course there would also be some overhead on both sides—Wald needed to write up the theory in a manner convincing to military commanders, and military commanders needed to set up the whole data-collection project—but there again, the data would have been far more expensive than the theory.

Conclusion: the data constraint was much more taut than the theory constraint. Theory was abundant relative to the scarcity of data.

… at least during WWII.

The Internet

More recently, you may have heard about a fancy new technology called “the internet” which makes it really, really cheap for people to share data. Given such a technology shift, we’d expect the data constraint to become slack much more often, leaving the theory constraint taut.

What does that look like? In academia, biology is a great example. Biologists today have massive piles of data—genomics, transcriptomics, proteomics, metabolomics, lots of omics, on a wide variety of organisms, tissues, cell types, etc. Yet our ability to turn all that data into engineered organisms or cures for disease remains underwhelming. I would bet that all the data needed in principle to, say, find a cure for Alzheimers is already available online—if only we knew how to effectively leverage it. We have all this data, but people don’t really understand how to use it. That’s what it looks like when the data constraint is slack and the theory constraint is taut.

Of course, biology isn’t the only field which looks like this. Economics is another—we have huge piles of data on prices, consumption, trade, taxes, and so forth, yet we have limited ability to turn it all into useful economic insight. We don’t even know what useful things to do with structured data like databases of prices and consumption, let alone unstructured data like the whole database of US federal regulations. Economists run studies on a small handful of datasets, or on very coarse aggregates, or sometimes just throw everything into a giant neural network to see what happens. We don’t yet have quantitative, gearsy models capable of absorbing and using a wide variety of data all at once.

Then there’s the entire field of data science. A decade after the internet took off, data science appeared more-or-less spontaneously as a response to companies with giant piles of data and no idea what to do with it all. That’s the sort of thing you expect to happen when a technology shift suddenly relaxes a previously-taut constraint: a complementary constraint becomes taut, and an industry appears to service it. There’s a reason the Wald legend is popular as an analogy for data science in general: it’s a perfect example of the value-add data scientists provide, the kind of “theory” which companies need in order to make their data useful.

Low-Hanging Fruit?

Our society has not had much time to adjust to the internet. We now live in a world where theory is scarce relative to data in far more places, but the world is still adjusting to this new reality. Where might we expect low-hanging fruit? What other constraints are likely to become taut/​slack?

One general area to look for low-hanging fruit: comprehensive overviews of possibly-useful topics. Track down a majority of all the physical capital assets of public companies, or skim titles/​abstracts from a few years of archives of scientific journals, or read the wikipedia pages on every country on Earth. This sort of exercise would have been very expensive before the internet, society hasn’t had a lot of time to experiment with it, so there’s likely to be low-hanging fruit in the area. Data is now very cheap, so consume a lot of it and see what happens.

Another angle would be to practice building gears-level models—especially with an eye toward integrating a wide variety of data sources into the model-building process. (See here and here for why we want gears.) Paper-Reading for Gears has some general tips on leveraging scientific papers toward this end. In terms of relevant general background knowledge, Pearl’s Book of Why is probably a useful resource for building/​testing (one type of) gearsy model statistically. But the most important tool here is the general habit of thinking in gears and asking the sort of questions which yield gears; I’d be interested to hear peoples’ suggestions for ways to learn those habits.

Finally, there’s one key constraint which I expect to become much more taut as theory becomes a more limiting factor: the inability to outsource expertise. In the Wald legend, military commanders wanted to add more armor to the places where returning planes had been shot most often. This seems so intuitively obvious that there isn’t any reason to consult an expert about it. The commanders didn’t have the background knowledge to realize they were making an important mistake, so they didn’t have any way to know that they needed an expert. Fortunately, in Wald’s case, consulting the expert was relatively cheap, so there wasn’t much reason not to do it. But what happens when consulting the expert is expensive? People won’t consult an expert unless there’s an obvious need for it—which means people will repeatedly be hit in the face by unknown unknowns.

This problem amplifies the value of comprehensive overviews and gears-level modelling skills. It’s not just that theory is scarce, it’s that theory is scarce and you can’t reliably outsource it. Money cannot reliably buy good theory, unless you already have some ability to recognize good theory. How can we recognize good theory, across a wide variety of applications, especially when there isn’t a clear objective metric for success? First, by studying how to build good theory in general—i.e. gears-level models. Second, by absorbing lots of background knowledge across lots of different areas, so we reduce unknown unknowns and gain more possible metrics by which to recognize experts.