Small Data

Probabilistic reasoning starts with priors and then updates them based off of evidence. Artificial neural networks take this to the extreme. You start with deliberately weak priors, then update them with a tremendous quantity of data. I call this “big data”.

In this article, I use “big data” to mean the opposite of “small data”. By this, “big data” refers to situations with so much training data you can get away with weak priors. Autonomous cars are an example of big data. Financial derivatives trading is an example of small data.

The most powerful recent advances in machine learning, such as neural networks, all use big data. Machine learning is good at fields where data plentiful, such as in identifying photos of cats, or where data can be cheaply manufactured, such as in playing videogames. “Plentiful data” is a relative term. Specifically, it’s a measurement of the quantity of training data relative to the size (complexity) of the search space.

Do you see the problem?

Physical reality is an upper bound on data collection. Even if “data” is just a number stored momentarily on a CPU’s register there is a hard physical limit to how much we can process. In particular, our data will never scale faster than where is the diameter of our computer in its greatest spacetime dimension. is polynomial time.

Machine learning search spaces are often exponential or hyperexponential. If your search space is exponential and you collect data polynomially then your data is sparse. When you have sparse data, you must compensate with strong priors. Big data uses weak priors. Therefore big data approaches to machine learning cannot, in general, handle small data.

Statistical Bias

Past performance is no guarantee of future results.

Suppose you want to estimate the mean variance of a Gaussian distribution. You could sample points and then compute the mean variance of them.

If you did you’d be wrong. In particular, you’d underestimate the mean variance by a factor of . The equation for standard deviation corrects for this and uses in the denominator.

An estimate of the variance of a Gaussian distribution based solely on historical data, without adjusting for statistical bias bias will underestimate the mean variance of the underlying distribution.

Underestimating mean variance by a factor of can be solved by throwing training data at the problem because a factor of vanishes as approaches infinity. Other learning environments are not so kind.

Divergent Series

Big data uses weak priors. Correcting for bias is a prior. Big data approaches to machine learning therefore have no built-in method of correcting for bias[1]. Big data thus assumes that historical data is representative of future data.

To state this more precisely, suppose that we are dealing with a variable where . In order to predict from past performance , it must be true that such a limit exists.

Sometimes no such limit exists. Suppose equals 1 for all positive integers whose most significant digit (in decimal representation) is odd and 0 for all positive integers whose most significant digit (in decimal representation) is even.

Suppose we want to predict the probability that an integer’s first significant digit is odd.

The average never converges. The average oscillates from ½ up to just over ¾ and back. You cannot solve this problem by minimizing your error over historical data. Insofar as big data minimizes an algorithm’s error over historical results, domains like this will be forever out-of-bounds to it.

Big data compensates for weak priors by minimizing an algorithm’s error over historical results. Insofar as this is true, big data cannot reason about small data.

Small Data

Yet, human beings can predict once-per-century events. Few of us can do it, but it can be done. How?

Transfer learning. Human beings use a problem’s context to influence our priors.

So can we just feed all of the Internet into a big data system to create a general-purpose machine learning algorithm? No. Because when you feed in arbitrary data it’s not just the data the increases in dimensionality. Your search space of relationships between input data increases even faster. Whenever a human being decides what data to feed into an artificial neural network, we are implicitly passing on our own priors about what constitutes relevant context. This division of labor between human and machine has enabled recent developments in machine learning like self-driving cars.

To remove the human from the equation, we need a system that can accept arbitrary input data without human curation for relevance. The problem is that feeding “everything” into a machine is close to feeding “nothing” into a machine, like how a fully connected graph contains exactly as much information as a fully disconnected graph.

Similar, but not equal. Consider Einstein. He saw beauty in the universe and then created the most beautiful theory that fit a particular set of data.

Beauty

Consider the sequence . What comes next?

  • It could be

  • It could be

  • It could be

  • It could be

You could say the answer[2] depends on one’s priors. That wouldn’t be wrong per se. But the word “priors” gets fuzzy around the corners when we’re talking about transfer learning. It would be more precise to say this depends on your sense of “beauty”.

The “right” answer is whichever one has minimal Kolmogorov complexity i.e. whichever sequence is described by the shortest computer program. But for sparse data, Kolmogorov complexity depends more on your choice of programming language than the actual data. It depends on the sense of beauty of whoever designed the your development environment.

The most important thing in a programming language is what libraries you have access to. If the Fibonacci sequence is a standard library function and the identity operator is not then the Fibonacci sequence has lower Kolmogorov complexity than the identity operator .

The library doesn’t even have to be standard. Any scrap of code lying around will do. In this way, Kolmogorov complexity, as evaluated in your local environment, is a subjective definition of beauty.

This is a flexible definition of “beauty”, as opposed to big data where “beauty” is hard-coded as the minimization of an error function over historical data.

Programming languages like Lisp let you program the language itself. System-defined macros are stored in the same hash table as user-defined macros. A small data system needs the same capability.

No algorithm without the freedom to self-alter its own error function can operate unsupervised on small data.

―Lsusr’s Second Law of Artificial Intelligence

To transcend big data, a computer program must be able to alter its own definition of beauty.


  1. ↩︎

    Cross-validation corrects for overfitting. Cross-validation cannot fully eliminate statistical bias because the train and test datasets both constitute “historical data”.

  2. ↩︎